Common API: How should we provide/format output data (specifically from cameras)?

campagnola commented 4 years ago

@David-Baddeley writes:

How should we provide/format output data (specifically from cameras)? - This is pretty hard to get right, and my experience with running sCMOS cameras at full frame rate is that a single memcpy() (or, e.g., numpy transpose) matters in terms of final performance. If we write a number of different camera drivers before working out what the abstraction is going to look like we're probably going to end up with various bits of fudge code (and likely copies) in between the drivers and the abstraction. This might kill us when we try and stream at full frame rate.

campagnola commented 4 years ago

My thoughts:

Never copy data, unless it is in danger of being overwritten (because ring buffer or similar)
In case a copy is necessary, also allow the user to specify to where the data should be copied (for example, into preallocated shared memory)
Results can be wrapped in a FrameData class that takes care of reformatting the data on-demand (for example, expanding 12-bit packed data or converting from weird colorspaces)

aquilesC commented 4 years ago

Just wanted to link to this repo, they are struggling to come up with a solution that is both efficient and easy to implement for newcomers.

campagnola commented 4 years ago

@aquilesC I see object proxying in that link; can you clarify how that relates to camera data formatting?

aquilesC commented 4 years ago

Perhaps should have linked to the proper line. Part of the discussion is in the docstrings of the classes. They struggle with 1. getting fast data transfer rates out of cameras -> therefore implemented shared numpy arrays, 2. teaching new people in the lab how to use shared memory, so they are trying to find sort of an API for it.

campagnola commented 4 years ago

In general I love this approach of using object proxying and shared memory; makes for very clean multiprocessing in Python. I helped write a similar approach for pyacq several years ago where we implemented @samuelgarcia's idea of streams.

Still, I think this belongs in a layer above the device drivers. So long as you can instruct the device driver to write directly into your shared memory buffer, you should be able to achieve good separation of concerns without a performance hit.

samuelgarcia commented 4 years ago

Hi all. I don't known excatly the purpose of what you are discussing here. I would be happy to make a presentatgion of pyacq and choice we made.

In short, we use proxy (same or differents machine) + multiprocessing. For stream we need a very very flexible concept of stream. Sharememory a one possible scenario. Python3.8 add the super new feature. It should be easier now. But in pyacq, we also have stream with data copy with socket based (zmq). Very usefull for differents machine. An impoprtant stuff is also to make flexible the memory layout (transpose or not) because depending the needs all approach must be possible within the same framework.

For instance , in multichannel signal (channel, time) vs (time, channel) is always a big debate between devs. In paycq there both are possible. numpy strides is very helpfull for that.

If you are bulding (or about to build), a package for python acquisition, pyacq is already for that. I would be happy to improve/break/refactor evrything that would make other dev happy to avoid duplicate of effort for grabbing data in ptyhon community.

I am in France but would be happy to have a video call.

David-Baddeley commented 4 years ago

I'd think that shared memory, object proxies, streaming etc ... reside at a higher level. What we should be aiming for is a structure which doesn't preclude that downstream.

As to duplication of effort - that ship might have already sailed - we already have quite mature streaming support etc .. in python-microscopy, but to bake that into a device driver seems a bit un-necessary.

A FrameData object with support for transformations might not be unreasonable, as long as it was light weight, didn't perform any transformations by default, and permitted simple access to the underlying frame memory (e.g. as a numpy array) without copying. If you are potentially running at several KHz (entirely possible on a ROI on sCMOS) I'd worry a little about the construction/allocation overhead of a FrameData object.

It's also a little hard to know how to manage things completely without copying whilst still maintaining expected/sane behaviour for anyone using the data downstream (i.e. if you just supply a reference to a slot in the camera's circular buffer it has an implicit expiry time after which it get's filled with new data and is no longer valid). Putting such a frame, e.g. on a queue to be spooled to disk has obvious potential issues if you run into anything which means that your spooling is temporarily slowed). I'll post a brief description of what we currently use and it's strengths and weaknesses in the hopes that it's useful for stimulating discussion.

David-Baddeley commented 4 years ago

Will also note that we've also played with shared memory arrays (https://github.com/python-microscopy/python-microscopy/blob/master/PYME/util/shmarray/shmarray.py). We've just used them for data analysis, not streaming, and there are a bunch of restrictions on how they can be used, especially on windows. They are rather sensitive to how processes are forked by multi-processing and need to be pre-allocated before anything forks . I'm not sure how useful shared memory is in a data acquisition sense though. For me multiprocessing with shared memory is really good for compute-intensive tasks, but not that helpful when things are limited by IO and memory bandwidth (threading is usually better for IO concurrency, and there is not a lot you can do about memory bandwidth).

David-Baddeley commented 4 years ago

Current PYME camera API (https://github.com/python-microscopy/python-microscopy/blob/master/PYME/Acquire/Hardware/Camera.py). Forgive the horrible method names which are a fairly gross throwback to legacy code. Anyway, the camera data handling is implemented in two methods - ExpReady() which can be polled to see if there is data waiting, and ExtractColour(output) which copies the oldest frame from the camera buffer into a numpy array, output, provided by the user. ExtractColour() would be better named something like get_frame_data(output) (it used to do de-bayering as well as just getting data).

How this is treated under the hood varies between cameras, depending on how the underlying API is written. The AndorIxon camera class, for example, passes a pointer to the numpy array (output.ctypes.data) directly to an Andor API function which copies the data from an API internal circular buffer into the numpy array. The APIs for the sCMOS cameras (Andor and Hamamatsu) however offload handling of frame buffers to the calling code so the PYME adapter for these cameras implement their own circular buffers in python. In this case, ExtractColour method does a memcopy (not a numpy copy - we call memcopy from the c std library using ctypes as it's a lot faster if super gross) between the camera class circular buffer and the provided output buffer. In both cases, you need to be a bit careful about how you allocate the 'output' array (needs to be contiguous with the right byte order and alignment).

The good

consistent between cameras
minimal manipulation
access as numpy array
caller owns the data (doesn't need to worry about it expiring)
camera class manages its own buffers (doesn't need to worry about what the caller does with the data)

The bad

extra copies for sCMOS cameras
caller needs to allocate data and pass by reference (not very pythonic)
polling to see if data is ready (could potentially be resolved by having a blocking version of get_frame_data(), although that also has issues)
lots of scope for badness in calling code (performance will depend on how calling code allocates memory - depending on spooling implementation could end up with either a second circular buffer or lots of expensive allocations for each frame)

The ugly (these are mostly implementation details which are fixable)

no sanity checking on 'output' array (potential for e.g. allocating undersize and getting memory corruption)
caller really needs to think like a c programmer when allocating output (could potentially be offset with a utility function in a camera base class)
Fortran byte ordering (this is PYME specific, not conceptual and due to wanting to have the x pixel index as the first index).

David-Baddeley commented 4 years ago

just a note to the above - the current PYME architecture would allow you to pass, e.g. , a shared memory array to receive data if you really wanted to.

campagnola commented 4 years ago

It's also a little hard to know how to manage things completely without copying whilst still maintaining expected/sane behaviour for anyone using the data downstream (i.e. if you just supply a reference to a slot in the camera's circular buffer it has an implicit expiry time after which it get's filled with new data and is no longer valid).

@David-Baddeley in our current prototype, the FrameData class can point to any array-like (numpy, shared memory, mmap) and would raise an exception if its data had been overwritten before access. There are a couple tweaks that could make it better, but I think it could cover any of the cases you described above with good performance. If you have time, I'd love to hear whether you see anything problematic in that architecture.

python-data-acquisition / meta

Common API: How should we provide/format output data (specifically from cameras)? #6