micro-manager / mmCoreAndDevices

Micro-Manager's device control layer, written in C++
40 stars 108 forks source link

New circular buffer/memory model #244

Open henrypinkard opened 2 years ago

henrypinkard commented 2 years ago

I opened this issue to solicit ideas and begin planning for what is going to replace the circular buffer. @nicost had you already opened an issue related to this? I did a quick search but nothing turned up.

@marktsuchida @nicost @nenada @jondaniels @bls337 @fadero @mattersoflight @ieivanov @edyoshikun @dpshepherd @carlkesselman (+ anyone else interested) would love to hear your thoughts and ideas.

A disorganized, partial list of current limitations and desired features for improvement:

Multi-camera support

Currently we use the same buffer for multiple cameras. Determining which image came from which camera is a nightmare as a result, and the image sizes/types of the cameras need to be the same. It may be possible in some cases to run multiple instances of the core, but this comes with other (device-specific) pitfalls, and requires hacks in higher-level code (i.e. acquisition engine)

Forced copies

Currently camera adapters copy into the buffer with InsertImage. Higher level code (Java/Python) then copies the data again to get access to it. A new buffer should ideally have the ability to handle pointers to image data, so that code can be written in Java/Python that writes to disk (in the c++) layer without unnecessary copies. This would make the API more complex and require memory management, which would ideally be hidden from most users by being abstracted away by the acquisition engine

Metadata handling

Metadata is not always necessary for every image in a stream, and may impact performance in some situations (e.g. when image size relative to metadata size is small). Flexibility to turn off metadata in certain situations would be a welcome feature.

Streaming directly to special locations

I can imagine scenarios where one might want to stream data directly to a GPU for real time processing, or stream directly to file/memory-mapped area with no intermediate processing. How feasible would it be to set up a general mechanism for doing this type of thing? What are the pitfalls we might run into?

Exotic camera types

There are a variety of new types of cameras. For example:

How can we make our memory model generic enough to support things like this, without making it overly complex? Is it possible to also make something that would be compatible with a point-scanning system in the future? We have the concepts of multi-component and multi-channel cameras currently. There may be ways to generalize and improve these.

marktsuchida commented 1 year ago

I agree with the need for most of these features.

Regarding elimination of copies, streaming to disk, and streaming to other destinations: I think the best way is to allow the application to do the allocation of buffers. Then a Java direct byte buffer or NumPy array can be used as the destination when the image is copied from the camera. This means that file saving can be written in Java or Python, which seems like a much more practical and flexible way than having C++ code writing to a file (which will be a maintenance burden; we should avoid coding in C++ when it's an option).

In this scheme, popping an image from the Core just means returning a reference, and the Core releasing its reference to the buffer. When the application is done processing the image, it can re-queue the buffer to the Core for next use. (The core should always have a stock of buffers to use; exhausting those is a "sequence buffer overflow".)

Such buffers could also be (slices of) memory-mapped files (provided that the Core preserves the order of buffers; this can be an optional setting because reusing the most recently used buffer first may offer better performance in some situations). Buffers wouldn't (normally) be re-queued in this case.

If also providing a live display, the buffer for the currently displayed image can be temporarily retained by the GUI, and returned to circulation once that image is no longer being displayed. This can work even with memory-mapped file buffers without making an extra copy.

In the case of small images at high frame rate, it might be necessary to bunch multiple images into each buffer to have high enough throughput (but this seems like an optional enhancement that can be added later).

Letting the application allocate the buffers also means that the application can decide when to release the memory and how much memory to allocate for each camera in each acquisition, and whether to use a single pool of buffers for multiple cameras or separate pools. This all seems much better than trying to teach the Core to handle all possible arrangements.

Regarding InsertImage: if there is going to be a new alternative to this function anyway, it might be worth considering allowing the device, instead of the Core, to perform the copying (perhaps you already meant this?). This is because some camera APIs can copy images to user-provided buffers but require an extra copy to work with the current InsertImage because they don't expose a direct pointer to the driver's internal buffer. It can also eliminate an extra copy for all RGB cameras who currently need to add a (useless) 4th component. (But then again, the Core should just support regular RGB images in 24- and 48-bit format.)

I don't know what the full constraints would be for streaming to a GPU. Presumably the most theoretically efficient way (aside from the camera talking directly to the GPU, which is probably outside of the scope of MMCore) is to copy from the camera driver's DMA buffer directly to GPU memory, but I don't think this is possible with most camera SDKs. If I'm right about that (definitely not an expert here), then the above scheme of application-allocated buffers should be the next best thing, because there will be only one extra copy (which may be impossible to eliminate anyway) and GPU-accelerated Java and Python libraries can be used. And it won't tie the Core to any specific GPU interface du jour (or worse, multiple interfaces).

Another important thing for high performance is to eliminate polling. Ideally there should be no calls to Sleep() in the device, Core, and application, when using the new interface.

henrypinkard commented 1 year ago

Many helpful design principles for this in: https://github.com/micro-manager/mmCoreAndDevices/issues/323#issuecomment-1433635766

artmeln commented 1 year ago

I’m continuing here the discussion of circular buffers from the perspective of the Data Streamer device. The kinds of physical devices I had in mind while implementing the streamer continuously produce large amounts of data and stream it to the PC. Some hardware produces data that is easy to understand and visualize (like a stream of values from an analog to digital converter) while other data will be impossible to make sense of without significant amount of processing (for example, time-tagged time-resolved photon counters that produce records of photon arrival). In order to accommodate the broadest range of hardware, I think it is best for the data streamer to rely on a simple circular buffer that does not make any assumptions about what is contained in it. Another feature of this data is that data transfers may not be of the same size. Therefore, on the PC side, the device adapter should first ask the hardware for the expected size of the data transfer, allocate that amount of memory, request the transfer of data, and note if it received less data than expected. According to this, my implementation is very simple, it is the circular buffer provided in boost libraries where each entry in the buffer is a pointer to the structure of {int expectedDataSize, int actualDataSize, byte[] dataBuffer}; here dataBuffer has length expectedDataSize but actualDataSize number of entries. While I’m not immediately aware of any devices that pass metadata along with each transfer, it is certainly a possibility, so it may be good to add metadata to the circular buffer record of a data streamer. Given that the data streamer is a very general device concept, I’m not sure what specific fields would be needed though.

Here are some additional thoughts based on the discussion above:

  1. One circular buffer per camera makes a lot of sense.
  2. Avoiding forced copies of the image data would be very nice. I don’t have any experience in these matters though.
  3. When it comes to exotic camera types, would it make sense to break up the problem into data acquisition and data visualization? Circular acquisition buffer would store acquired frames without really knowing what they are, then a processing step would use this data to construct entries for a circular display buffer. The upper software layer would look at the display buffer and visualize it as soon as data is available. There need not be 1:1 correspondence between acquisition and display buffers; for example, multiple displayed images could be constructed from a single acquired frame. This model would also support a point-scanning system, a device other than a camera device could put data into the display buffer.
henrypinkard commented 1 year ago

Regarding elimination of copies, streaming to disk, and streaming to other destinations: I think the best way is to allow the application to do the allocation of buffers. Then a Java direct byte buffer or NumPy array can be used as the destination when the image is copied from the camera.

This makes sense. Though I think this only really matters for Java, because in python you can use ctypes to wrap existing native memory in the same process (whereas I don't think this is true in Java, though I may be wrong about this).

This means that file saving can be written in Java or Python, which seems like a much more practical and flexible way than having C++ code writing to a file (which will be a maintenance burden; we should avoid coding in C++ when it's an option).

Agree.

In this scheme, popping an image from the Core just means returning a reference, and the Core releasing its reference to the buffer. When the application is done processing the image, it can re-queue the buffer to the Core for next use. (The core should always have a stock of buffers to use; exhausting those is a "sequence buffer overflow".)

Yes.

Such buffers could also be (slices of) memory-mapped files

Related to https://github.com/micro-manager/mmCoreAndDevices/issues/323#issuecomment-1435046759, I'm not clear what, if any, use case there is for using memory mapping, since it doesn't appear to speed up file saving.

(provided that the Core preserves the order of buffers; this can be an optional setting because reusing the most recently used buffer first may offer better performance in some situations). Buffers wouldn't (normally) be re-queued in this case.

Can you say more about this? When might reusing recent buffers boost performance?

Regarding InsertImage: if there is going to be a new alternative to this function anyway, it might be worth considering allowing the device, instead of the Core, to perform the copying (perhaps you already meant this?).

Yes, essentially an alternative to InsertImage that is something like GetBuffer(nBytes)

Another important thing for high performance is to eliminate polling. Ideally there should be no calls to Sleep() in the device, Core, and application, when using the new interface.

Yes, good to keep this in mind

henrypinkard commented 1 year ago

Thanks for the many good suggestions @artmeln !

I think it is best for the data streamer to rely on a simple circular buffer that does not make any assumptions about what is contained in it

I think this is a good principle for this buffer in general. Or, for what assumptions do exists, have them configurable enough that they can work with use cases we haven't yet thought of.

Therefore, on the PC side, the device adapter should first ask the hardware for the expected size of the data transfer, allocate that amount of memory, request the transfer of data, and note if it received less data than expected.

One way this could work that is consistent with the camera SDKs I've looked at would be to have the camera device adapter would be able to request a buffer of a certain size, and then signal when it is done with the block of memory. It sounds like you're advocating for making that done call have an optional parameter describing how much of the buffer was actually written to, which seems like a nice (optional) feature

While I’m not immediately aware of any devices that pass metadata along with each transfer, it is certainly a possibility, so it may be good to add metadata to the circular buffer record of a data streamer. Given that the data streamer is a very general device concept, I’m not sure what specific fields would be needed though.

I think this would have to be left flexible, though maybe when some special standardized keywords for metadata of particular device types

  1. When it comes to exotic camera types, would it make sense to break up the problem into data acquisition and data visualization? Circular acquisition buffer would store acquired frames without really knowing what they are, then a processing step would use this data to construct entries for a circular display buffer. The upper software layer would look at the display buffer and visualize it as soon as data is available. There need not be 1:1 correspondence between acquisition and display buffers; for example, multiple displayed images could be constructed from a single acquired frame. This model would also support a point-scanning system, a device other than a camera device could put data into the display buffer.

This is an interesting idea. @marktsuchida suggests something similar in https://github.com/micro-manager/mmCoreAndDevices/issues/323#issuecomment-1457044288. @marktsuchida, as I read it, your comment above:

If also providing a live display, the buffer for the currently displayed image can be temporarily retained by the GUI, and returned to circulation once that image is no longer being displayed.

is maybe advocating for something different here? It would be good to figure out the details of what configuration makes most sense here.

I don't think this should be explicitly hard coded in any kind of required way though, because the vast majority of users who write their own processing display/code will not want to to it from c++. But it would be interesting to explore if buffers can be kept general enough such that one could create a second generic buffer and use it for display, without needing anything special.

marktsuchida commented 1 year ago

Such buffers could also be (slices of) memory-mapped files

Related to #323 (comment), I'm not clear what, if any, use case there is for using memory mapping, since it doesn't appear to speed up file saving.

I don't remember seeing any evidence that memory-mapped files do not speed up saving. They may not be as fast as unbuffered writing (#323, but have we seen a comparison?), but they could still be useful in some situations (e.g., particular file formats implemented in Python). Also, a special case of memory-mapped files (so to speak) is shared memory.

The general point is that allowing the application, not the Core, to allocate destination buffers can eliminate extra copies in several potential situations, including Java direct byte buffers. And allocation-by-Core can still be an optional mode under such a design.

(You are right that Python can handle memory allocated by C++ directly and view it as an array, but there are still cases where allowing allocation on the Python side will offer greater flexibility, because it means Python code can use anything to allocate and requires nothing special to manage the buffer lifetime once handed back from the C++ side.)

However, this scheme may also make the buffer design significantly more complex, and given that a separate mechanism (#323) may be available for raw performance, it may not be an absolute requirement. It depends on what use cases you want to support (and BTW writing out fairly concrete usage scenarios might be a good way to determine what constraints need to be satisfied by the design; without this, it is hard to argue for or against a particular design).

(provided that the Core preserves the order of buffers; this can be an optional setting because reusing the most recently used buffer first may offer better performance in some situations). Buffers wouldn't (normally) be re-queued in this case.

Can you say more about this? When might reusing recent buffers boost performance?

If they are small enough, they may fit in L2 or L3 cache. Even if they are not that small, reusing the same address range may decrease TLB misses. Whether this makes a meaningful difference in our use cases I do not know (and can only be known by extensive testing). It is also entirely possible that prefetching may eliminate the difference. I wouldn't worry about this too much, especially at this early stage. When it is easy to do so, it is best to maximize spatial and temporal locality.

marktsuchida commented 1 year ago

@marktsuchida, as I read it, your comment above:

If also providing a live display, the buffer for the currently displayed image can be temporarily retained by the GUI, and returned to circulation once that image is no longer being displayed.

is maybe advocating for something different here? It would be good to figure out the details of what configuration makes most sense here.

Yes, different solution for different problem. When using the unbuffered writing (#323; which I view as a special enhancement not used by everyone every time), the C++ layer needs to dictate memory allocation and realistically the images for display (which should be a small fraction of total data) will probably need to be copied once to decouple from saving. (Theoretically no copy is required, but I suspect it will be complicated to make that work.) In this case, we need at least 2 buffers, one for saving and one for sending images to the app. The latter need only contain a sampling of the images, and the app can choose not to enable it (e.g., if not displaying images in real time).

In contrast, here in this issue we are talking about conventional buffering, without saving in the Core. In this case, (in a typical use case) every single image is forwarded to the Python/Java app and saved there; some may be displayed. To do this without extra copies, the app (Python/Java) needs to hold onto each image buffer for the duration of saving/display, so that it is not overwritten or deallocated, and then release it when done.

This "releasing" is done by returning the finished buffer back to the Core for reuse or deallocation. This is mandatory if the buffer is allocated by the Core, and optional if allocated by the app. But in most cases reusing the buffers is desirable (unless they are special unique buffers, such as memory-mapped files) because allocation is usually very time-consuming.

In this scenario, you could look at it as the second buffer being implemented on the app side. This is similar to how things currently work: every image is popped from the Core and placed in a Datastore (playing the role of the second buffer); a subset is then displayed.

And to tie the two scenarios together: the "second" buffer in the unbuffered writing (#323) case can be the same thing as the regular sequence buffer used for the conventional case.

henrypinkard commented 1 year ago

I don't remember seeing any evidence that memory-mapped files do not speed up saving. They may not be as fast as unbuffered writing (#323, but have we seen a comparison?),

I'm inferring this from 1) no one who has done really fast saving in c++ used memory-mapping and 2) when I was making OME-Tiff/NDTiff I initially tried memory mapping, and the performance was significantly worse than just allocating direct byte buffers.

but they could still be useful in some situations (e.g., particular file formats implemented in Python). Also, a special case of memory-mapped files (so to speak) is shared memory.

I'm definitely not against it, just unfamiliar with what these situations are. If you could provide more detail about this it would be helpful.

The general point is that allowing the application, not the Core, to allocate destination buffers can eliminate extra copies in several potential situations, including Java direct byte buffers. And allocation-by-Core can still be an optional mode under such a design.

(You are right that Python can handle memory allocated by C++ directly and view it as an array, but there are still cases where allowing allocation on the Python side will offer greater flexibility, because it means Python code can use anything to allocate and requires nothing special to manage the buffer lifetime once handed back from the C++ side.)

However, this scheme may also make the buffer design significantly more complex, and given that a separate mechanism (#323) may be available for raw performance, it may not be an absolute requirement. It depends on what use cases you want to support (and BTW writing out fairly concrete usage scenarios might be a good way to determine what constraints need to be satisfied by the design; without this, it is hard to argue for or against a particular design).

Yes, this all makes sense. I agree that it would be helpful to list specific key use cases.

This "releasing" is done by returning the finished buffer back to the Core for reuse or deallocation. This is mandatory if the buffer is allocated by the Core, and optional if allocated by the app.

I'm confused as to how it would be optional if allocated by the app. If a camera device adapter is requesting a buffer to write memory into, how does this core know if it still has an application-allocated buffer to write into?

And to tie the two scenarios together: the "second" buffer in the unbuffered writing (#323) case can be the same thing as the regular sequence buffer used for the conventional case.

This and everything else in the comment make sense. It seems appealing to me that display buffers, buffers for unbuffered writing (#323), and buffers that mirror the current setup of copying into the App memory space can all be instances of the same class with different configuration options, unless there is a reason why this cannot be the case