Ways of speeding up cycle time. Pipelining, readahead, removing redundant steps.

philipstarkey commented 4 years ago

Original report (archived issue) by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

There are a few ways we might speed up the overall rate at which shots are run. Some are pretty invasive so it's not a minor change, but the basic idea is to split up transition_to_buffered() and transition_to_manual() into multiple steps, and a) only call the ones that are necessary, and b) call the ones that are not dependent on previous steps simultaneously.

So for example, transition_to_manual() could be split into:

Read data from hardware (this can be done before the shot is even complete)
Save data to shot file (this can be done after the shot is complete, even whilst the next shot is running)
Get hardware into an appropriate state for either a) another buffered run or b) actual transition to manual. (program_manual could be skipped unless the queue is paused)

transition_to_buffered() could be split into:

Read instructions from the shot file (this can be done before the previous shot is complete, and can be completely skipped if it is known that the shot is an exact repeat of the previous one)
Program the hardware

Running as many of these steps as possible simultaneously, and skipping unnecessary ones could go some way to speeding up the cycle time of BLACS. In the ideal case, devices that are retriggerable with the same date will not need any reconfiguration in between shots, and will contribute no overhead.

Profiling will reveal what the actual overhead is. If after fixing the above sources of overhead (if they are what's dominating), it turns out that opening and closing HDF5 files is the slow thing, then we can have some kind of intelligent "readahead" in one hit in a single process as soon as the shot arrives in BLACS - knowing based on previous shots what groups and datasets a particular driver opened, the data can be read all ahead of time and the worker process will see a proxy HDF5 file object which requires no zlock to open, and which already has all data available, only opening the actual shot file if the driver attempts to read a group that was not read in advance. This would consume more RAM, so should be disableable of course.

These are the sorts of optimisation we could do, but before any of it I would want to do profiling, marking particular functions and when they were called, and getting some statistics to see where the bottlenecks are.

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

changed title from "Ways of speeding up cycle time. Pipelining, removing redundant steps." to "Ways of speeding up cycle time. Pipelining, readahead, removing redundant steps."

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Edited issue description

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Edited issue description

philipstarkey commented 4 years ago

Original comment by Philip Starkey (Bitbucket: philipstarkey, GitHub: philipstarkey).

I suspect another slow point is the NI cards with multiple worker processes (which is most of them I think).

There is no particular reason why communication with each worker process needs to be serialised, other than the fact that it’s a bit more complicated to implement.

To change this we would need to rewrite the mainloop in the tab base class (maybe taking advantage of some Python3 coroutine features?). The yield calls in GUI methods would need to (optionally, for backwards compatibility) return “promises” (a concept from JavaScript I think…effectively an object you query later for the work and equivalent to what we do with inmain_later). That way, all worker processes can do work simultaneously, speeding up the transitions.

This will be particularly effective if we cache the HDF5 file.

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Ah, that's a good point. I think that by itself could be accommodated in the current framework by just having the coroutine yield a dict of jobs to do for each worker, instead of just one - then the mainloop can wait on them all simultaneously with a poll or select call. The fact that each worker is only running one function at a time means it's not that different to what we have now. Writing data to HDF5 files at the same time as setting up the next shot though, that would be workers doing two things at once, so would be more involved (though not really solved by async/await since it's cross-process, and since we would want the multiple worker tasks to be executing in true parallel, not just the single-threaded concurrency that coroutines get you. Speaking of this, if h5py still holds the GIL these days for I/O, then we won't get a speedup by running it in a separate thread, we'll need to pipe the data to another process...hm...maybe have to write in bulk in a separate process as well as read. Yikes.).

I'm currently a bit averse to async/await when threads suffice. I investigated it for the new zlock server and it was a) overkill in terms of complexity and b) not very performant. What we're doing with the yield-based generators in the GUI is definitely what's intended to be covered by async/await, but our needs are modest and I suspect we will still want control over our own mainloop as we do now. We could switch to the new syntax though - I believe you can use your own event loop and have async/await syntax be a drop-in replacement for how our coroutines are defined presently. There are also more performant 3rd-party event loops available. Worth thinking about though, I can't say my aversion is well-justified.

philipstarkey commented 4 years ago

Original comment by Chris Billington (Bitbucket: cbillington, GitHub: chrisjbillington).

Progress on the h5py GIL front!

https://github.com/h5py/h5py/pull/1412

https://github.com/h5py/h5py/pull/1453

It sounds like soon threading will be sufficient to do HDF5 I/O blocking other threads. Good. Writing a server just to do HDF5 writes would be so far from what HDF5 is supposed to be that we might as well be using a traditional database at that point...

Edit: Just tested with development h5py from github, and indeed threads can run during IO! This is great. It will be in the next release, which might be early to mid 2020 judging by their past releases.

labscript-suite-temp / blacs

Ways of speeding up cycle time. Pipelining, readahead, removing redundant steps. #53