labscript-suite / runmanager

𝗿𝘂𝗻𝗺𝗮𝗻𝗮𝗴𝗲𝗿 is an intuitive graphical interface for controlling 𝘭𝘢𝘣𝘴𝘤𝘳𝘪𝘱𝘵 𝘴𝘶𝘪𝘵𝘦 experiments. Includes multi-dimensional parameter scans and a remote programming interface for automation.
http://labscriptsuite.org
Other
3 stars 45 forks source link

Make runmanager remote timeout configurable #96

Closed zakv closed 4 years ago

zakv commented 4 years ago

Issue

In the past (see here) the runmanager remote timeout had to be increased to account for the fact that updating globals can take a while. However as mentioned here I've still seen timeouts on my machine even after that PR bumped up the period up to 15 seconds.

Updated Proposal:

Add support for a communication_timeout option in the [timeouts] section of the labconfig.

Original proposal:

This PR proposes simply bumping up the timeout period to 60 seconds. I haven't seen that fail yet on my machine.

philipstarkey commented 4 years ago

I think that this will ultimately just be a stopgap measure. My experience with runmanager (not using the remote interface) has been that the time needed to update a global is related to both the number of globals and the size of the globals h5 file. The globals h5 file grows in size as globals are changed (I'm not certain if this is every time a global changes or only under certain conditions) due to the way h5 files allocate new space rather than reusing existing allocations. I've had globals files approach 100 times their real size, and runmanager starts getting painfully slow at that point!

So I'm not certain that there is an appropriate upper limit for this timeout...

Still worth increasing to 60s, but it could be worth investigating why it's so slow?

@zakv How big is your globals h5 file at the moment (on disk) and how big is it if you copy the globals to a new h5 file (an easy way to check this is to compile a shot with almost no hardware instructions I think)?

zakv commented 4 years ago

I have globals split across different files, but they're typically 10 kB to 100 kB after running them through h5repack to remove the bloat. Unfortunately I did that recently and don't remember exactly how big the files were before, but I've seen them grow to over 1 GB. When running M-LOOP optimizations many globals change frequently so the hdf5 files can expand pretty quickly.

I've also seen that the global updating time can change based on system load, so I agree that there isn't really a good strict upper bound for how long this could take. Maybe a more robust approach would be to have time-consuming remote commands provide two responses, one to say "command received" and another to say "command completed"? There could potentially be a separate timeout period for each response.

I haven't looked too much in to why it's slow sometimes, but I'd think that runmanager's code's execution time should scale with the number of updated globals but not with the size of the hdf5 file on disk. If the bloat slows things down I'd expect that to occur at a lower level, somewhere in h5py or below.

If you think it'd be worthwhile I could run some tests where I just repeatedly change some dummy globals and plot how long each global update takes as a function of file size to see how it grows. I might do that anyway just out of curiosity. I could also do that using h5py directly to bypass any runmanager code and verify that the issue is at a lower level.

zakv commented 4 years ago

I ended up trying out some of the tests I mentioned and the results are a bit surprising. I made a quick script to repeatedly set the values of 10 dummy globals, then record how long it took to do so and the size of the globals file. The results for when the remote requests and globals file are located on the computer running runmanager are below.

image

Interestingly although the file size grows linearly to ~600 MB, the time to set the globals doesn't really increase. The bottom plot is a parametric plot of the duration of the call to set_globals() as a function of the globals file size, which ended up looking exactly like the top plot because the file size is proportional to number of iterations.

I did this again but with the globals file stored on a network drive, which is where we normally store them. The actual durations were a bit longer, but otherwise the results were similar.

image

Notably none of those calls took anywhere close to 15 seconds. Maybe repeatedly setting the values keeps the file in the system's cache and makes things quicker than they usually would be?

I only use set_globals() during M-LOOP optimizations, so during the remote requests there's usually some intensive calculations going on, e.g. training neural nets. Maybe that system load can slow things down more than I initially expected. I'll see if the set_globals() calls typically take longer with that stuff running.

philipstarkey commented 4 years ago

Interesting! Not what I was expecting given my prior experience, but pretty conclusive! Given it seems likely the timeout is actually related to how heavily the PC is being used, perhaps a better long term solution would be to have the timeout user configurable. That's a bit messy to do given there is a default Client object. Might need the timeout to use a module level variable that can be updated by the user at any time...?

If you'd rather just merge this for now and we log an enhancement proposal in the issue tracker for the more complex solution I would be fine with that too. 60s is probably a decent default for the timeout whatever the ultimate solution is.

zakv commented 4 years ago

A user-configurable timeout sounds reasonable to me! I wouldn't mind implementing your module-level-global idea for this PR if you think that's the way to go. Let me know and I'll get started.

Also, we very recently moved our network drive to a new computer and all of the above tests were done with this new machine. The timeouts I've seen occurred a while ago when we were using our old network drive on a somewhat outdated machine. As a check I tried the same test as above (still without M-LOOP running) with the globals on our old network drive. To speed things along I only did 1000 iterations this time, which also meant that the file didn't bloat as much. It usually still ran in pretty quickly, but there are some outliers which took much longer than usual and nearly hit the 15 second timeout.

image

I also did a quick check to see if it was a RAM issue. I just opened a bunch of chrome tabs to limit the available memory, but the results were surprisingly similar. I actually briefly thought I was just plotting the old data again accidentally.

image

So it seems that set_globals() can randomly take much longer than usual, but so far I've only seen that happen with the globals stored on a lower-performance computer.

philipstarkey commented 4 years ago

A user-configurable timeout sounds reasonable to me! I wouldn't mind implementing your module-level-global idea for this PR if you think that's the way to go. Let me know and I'll get started.

@chrisjbillington Any thoughts on my proposal?

chrisjbillington commented 4 years ago

Merging this as-is seems fine, as for a user-configurable timeout, perhaps there ought to be a more general "communication timeout" setting in labconfig that applies to more than just runmanager. But I wouldn't mind considering that low-ish priority and seeing if 60 seconds is sufficient.

Agreed that it would be nice to distinguish connection timeout from response timeout, but the issue implementing this in e.g. the runmanager remote communication protocol is that if the response is slow due to high system load, then any initial acknowledgement may be just as slow. Zmq connections do occur in a non-GIL-requiring thread running native code though, so there are mechanisms for zmq clients to be notified of a new connection and timeout based on that. So that ought to be implemented in zprocess I think.

So I would support:

  1. Merging this
  2. A labconfig option for network/communication timeouts in this general category, that would replace hard-coded values throughout the suite.
  3. Adding the ability to zprocess to accept a different timeout value and raise a different exception for a connection timeout as opposed to a response timeout (something like this exists already, since monitoring connection success was required to raise sensible errors for security misconfigurations, but the exception raised and timeout value used are not distinguishable from a response timeout at present). Then a connection timeout value can perhaps be used that is quite short (e.g. 15 seconds), but the default response timeout can remain large (e.g. 60 seconds). This will ensure errors about misconfigured networks or apps not running are raised quickly, whereas we want to be very forgiving of actual response delays if we're pretty sure an app is receiving a client's messages!
chrisjbillington commented 4 years ago

Oh and yes, a module-level global ensures this is hackable in the meantime.

@zakv, adding a labconfig setting defaulting to 60s is not that much harder than defining a global. It would look like this (maybe in the __init__() of runmanager.remote.Client() where some other labconfig settings are read - maybe add timeout as a keyword argument to the constructor as well, like the other settings):

timeout = LabConfig().getfloat('timeouts', 'communication_timeout', fallback=60)

and then adding a section (I can't think of another section for it to go in) to the default labconfig in labscript_utils

[timeouts]
communication_timeout = 60
zakv commented 4 years ago

Sounds good, will do.

I'll make a PR in labscript utils as well to add the [timeouts] section to example.ini as well. Is there anything else I should edit?

zakv commented 4 years ago

Implemented @chrisjbillington's suggestions in 8cfb5e5. I ran the file through black with --skip-string-normalization which seemed to be the consensus on the "general" - "code formatting" Zulip stream/topic. I left the line length at the default value since there seemed to be less of a consensus there, so there's some wrapping. Happy to reformat if desired.

chrisjbillington commented 4 years ago

Great! Looks good to me. The bit of extra wrapping is no big deal because it's so trivial, though I'm unsure how to approach more complex files in general. I've configured a keyboard shortcut in my text editor to blacken only the current block of code which lets me write blackened code without otherwise messing with a file, that would be ideal but isn't really practical to expect contributors to set this up. Fine to play it by ear for now.

I'll make a PR in labscript utils as well to add the [timeouts] section to example.ini as well. Is there anything else I should edit?

Nope, that's all you would need to do!