micro-manager / mmCoreAndDevices

Micro-Manager's device control layer, written in C++

39 stars 101 forks source link

Plan for fast C++ layer file saving #323

Open henrypinkard opened 1 year ago

henrypinkard commented 1 year ago

Being able to save data to disk as fast as possible is an important feature for MM to be able to keep up with the latest scientific cameras, support many cameras in parallel, and be future proof against new types of microscopy.

NDTiff has currently been clocked at write speeds of multiple GB/s for hours at a time (on @dpshepherd's NVMe RAID0 setup).

Using some benchmarking code, a python script for running it, and a jupyter notebook for analyzing the performance of various buffers and queues along the way with NDTiff + AcqEngJ, it seems that the main bottleneck is actually writing data to disk, not any of the intermediate serialization/deserialization steps. It also seems that the performance maxes out well below the maximum theoretical write speed of the drive.

Presumably, this is in part because data is written to disk inefficiently. @edyoshikun has written and shared c++ code that can save ~6.5 GB/s (again on an NVMe RAID setup), which works by concatenating many blocks of image data together and a smaller number of big writes to disk at a time with a a block size optimized to the underlying OS.

This code can fairly easily be adapted to a file saving class in the c++ layer. The format can be adapted to fit the NDTiff spec, without the TIFF-specific headers--hence the proposed name "NDBin". This would allow the reuse of existing Java and Python readers and a suite of unit tests.

Once able to write at these speeds, it is possible that in-memory bottlenecks will appear upstream, related to how data is copied into buffers etc. @tomhanak do you have any insights/advice to share?

dpshepherd commented 1 year ago

Last time we did these tests, we did measure NDTIFF successfully writing in the 1-3 GB/s range for hours. I know NDTIFF has seen a lot of changes since then (although it's unclear to me how much changed on the actual file writing side), so we probably should test again.

We recently installed a new NVME RAID0 setup on a PCIe 4.0 card, and can do some new testing using a real camera once the current set of experimental runs is finished next week.

marktsuchida commented 1 year ago

concatenating many blocks of image data together and a smaller number of big writes to disk at a time with a a block size optimized to the underlying OS

Is this with or without the use of memory-mapped files? That seems like an important detail. Intuitively I would imagine that the fastest way would be to map relatively large regions of the destination file (at least megabytes at a time) to memory and copy images in, ideally in sequential order.

Something like this can be tested using pymmcore and numpy.mmap, although one extra copy (and allocation) at the C++ -> Python boundary cannot currently be avoided (this could be eliminated by enhancing the SWIG wrapper). This might also be useful in evaluating whether it is truly necessary to do file saving in the Core (which I don't think is really ideal form a modularity standpoint, although I won't oppose it if shown to be necessary).

Regarding bottlenecks from metadata serialization-deserialization: at least in theory, if you have enough CPU cores and do the writing in a separate thread, with some queuing, then the writing speed should not be affected by ser-des overhead unless the latter takes more time (per image) than the former. In fact, if there is buffering just before the writing step, then everything else will only matter indirectly, potentially through saturation of CPU and memory bandwidth, provided that the writing thread does not synchronize with other threads too frequently. (In the case of writing from Java using anything other than direct byte buffers, there may be additional overhead on the writing thread.)

tomhanak commented 1 year ago

We have spent a lot of time on this topic and validation of various approaches with different disk setup. It is nearly impossible to summarize recommendations in one Github comment. All depends on goals you want to achieve with new implementation.

Hardware

We have achieved best results with HighPoint SSD7101A-1 RAID adapter with 4x Samsung 970/980 Pro NVMe disks in raid0. The main point here is the highest sustainable write speed. There's a lot of benchmarks on Tom's Hardware and other portals. On Dell 5820 (with PCIe 3.0 slots only) we were able to achieve around 9000MB/s on Windows and Linux. There are PCIe 4.0 raid adapters for 8 disks that perform even better.

Software

The key from the SW point of view is to avoid all kind of buffering and copies the application or OS can do. On Windows open file with CreateFile() call and give it FILE_FLAG_NO_BUFFERING option, on Linux give O_DIRECT flag to open() call. Platform-specific API must be used, because standard C/C++ libraries don't expose such options. Without those flags the write throughput is much lower. However, in order to use these flags, there are strict requirements to the buffer being written. The buffer start address as well as its size must be aligned to disk sector size. We use 4kB alignment that seems to fit most of the sector sizes used.

Ideally, there should be done only one copy of each image - from the acquisition buffer (e.g. the circ. buffer PVCAM fills via DMA) to well-aligned buffer used for streaming to disk. The copy should be parallelized, at least on Windows. The optimal number of threads should match number of memory channels supported by CPU and motherboard and that the right number of RAM modules is used...

Other observations

Unfortunately, using the right HW with the right API calls and flags isn't sufficient.

Writing each frame directly to disk is slower that caching multiple frames in bigger buffer and then write one bigger file to disk.
Every buffer in RAM must be allocated in advance and must be also prefetched, e.g. by writing one byte to every memory page allocated before the acquisition is started. A malloc or new does nothing but registers the memory for later use, on first access. The first access takes much longer. Prefetching 100GB in RAM can take over a minute!
There must be multiple buffers allocated and used in round-robin fashion, at least 3. One buffer is filled with data, second is written to disk and third or more for load-balancing any delays and CPU-load glitches.
The buffer size matters and depends on the size of used disks. For 4TB raid0 array (four 1TB disks) we found that max. buffer size is around 1GB. Reusing bigger buffers resulted in throughput drops and possible frames lost.
It's all about raw data. Doing the same e.g. with tiff files is doable only with custom simplified implementation and passing custom procedures to libtiff library that work on top of aligned buffer in RAM.
The CPU power states (C-states) can play a role here. On some computers it doesn't matter, on others the throughput is better with C-states enabled and on the rest it's better to disable C-states.
After wiping the data from raid disk it is not possible to start writing to it immediately. We observed that min. 15 minutes the disk is busy with some internal cleanup and the write speed is lower.
All the above is for one camera and one data stream to disk. For two or more cameras the throughput isn't shared equally. Having a disk capable of 9000MB/s with one camera cannot handle two cameras sending 4000MB/s each. For multiple cameras there must be some technique developed that would interleave frames from all cameras in RAM buffer that later written to disk as single stream. But we didn't went so far yet...

henrypinkard commented 1 year ago

Last time we did these tests, we did measure NDTIFF successfully writing in the 1-3 GB/s range for hours. I know NDTIFF has seen a lot of changes since then (although it's unclear to me how much changed on the actual file writing side), so we probably should test again.

We recently installed a new NVME RAID0 setup on a PCIe 4.0 card, and can do some new testing using a real camera once the current set of experimental runs is finished next week.

Thanks @dpshepherd that would be great! I don't think there have been any substantive changes that would affect performance since then, but it would be nice to get a hard number with the script. It might also be a good idea to run CrytalDiskMark to get an idea of how the performance compares to the maximum achievable by the drive.

henrypinkard commented 1 year ago

Is this with or without the use of memory-mapped files? That seems like an important detail. Intuitively I would imagine that the fastest way would be to map relatively large regions of the destination file (at least megabytes at a time) to memory and copy images in, ideally in sequential order.

@marktsuchida this is without memory mapping. The relevant calls on windows are:

hFile = CreateFile(FileName,                // name of the write
        GENERIC_WRITE,          // open for writing
        0,                      // do not share
        NULL,                   // default security
        CREATE_ALWAYS,             // create Always
        FILE_FLAG_NO_BUFFERING,  // Nobuffering file
        NULL);                  // no attr. template

        bErrorFlag = WriteFile(
            hFile,           // open file handle
            pImageBuffer,      // start of data to write
            dwBytesToWrite,  // number of bytes to write
            &dwBytesWritten, // number of bytes that were written
            NULL);            // no overlapped structure

To me this endeavor seems like it would be a lot easier to just attempt directly in c++, since: 1) we already have example code 2) There are a variety of OS-specific calls that would need to be figured out in both Java and Python 3) It seems from @tomhanak's comment that a lot of the complexity will be handled in the buffers (https://github.com/micro-manager/mmCoreAndDevices/issues/244) anyway.

Regarding bottlenecks from metadata serialization-deserialization: at least in theory, if you have enough CPU cores and do the writing in a separate thread, with some queuing, then the writing speed should not be affected by ser-des overhead unless the latter takes more time (per image) than the former. In fact, if there is buffering just before the writing step, then everything else will only matter indirectly, potentially through saturation of CPU and memory bandwidth, provided that the writing thread does not synchronize with other threads too frequently. (In the case of writing from Java using anything other than direct byte buffers, there may be additional overhead on the writing thread.)

I agree. I got a speed up in NDTiff a while back when I switched serialization to another thread, and I've yet to see evidence that this ever causes a bottleneck (even with the particularly slow serialization of mmcorej.json)

henrypinkard commented 1 year ago

Thank you @tomhanak! This is all extremely helpful. I'm sure more questions will arise as we continue to move forward on this

henrypinkard commented 1 year ago

More insights from @edyoshikun's implementation:

Summary: The acquisition engine was implemented to simultaneously capture images from all 25 cameras at >100FPS, which is about 5.76GB/s for all cameras. The main idea is to write raw image data directly to the RAM swap buffer to then write the RAM buffer directly to the disk as a binary file. Once the acquisition is completed, we convert from binary to .raw file format per camera by chunking the binary. The engine creates individual threads per camera, one for reading and one for writing.

We use the windows SDK File Handler library for rapid storage into the SSD partitioned in 512B sectors. We bypass the default read/write functions by buffering with a pre-allocated memory aligned to the file-system offsets (buffers aligned for the 512B sector offsets). The buffer size is equal to the camera number times the image size summed with 512-(image size%512) to do pointer arithmetic and mem copy to pre-allocated memory locations. For simultaneous reading and writing, we use ping-pong or swap buffers that dynamically swap every time we fill the buffer. **Caveat of using these dynamic swap buffers is that you must tell it when to stop. So you need a counter to keep track of the number of frames written, and once this is done, we signal the acquisition threads to halt.

Need to disable contents indexed:

Another setting we had to change in the Windows filesystem to make this work was to disable 'contents indexed in addition to file properties'. We found this sorting added additional and unnecessary overhead to the filesaving speed.

Buffers are created with _aligned_malloc

// Their NVMe uses 512 b sectors
uint8_t* buff1 = (uint8_t*)_aligned_malloc(buff_size, 512);

When you pass each buffer to be written to disk, is it appending to a single file, or is it creating a new file each time? Was that a choice that mattered for performance?

We chose to write multiple files in case something got corrupted or the acquisition or halted. This ties back to the 1 second of data (based on FPS) for simplicity and to not have to call the writer multiple times.

edyoshikun commented 1 year ago

Yes, I was getting about ~7.5-8GB/s (CrystalDiskmark) using RAID 0 of 3 Samsung 970 Pro NVME Drives that were directly on the motherboard.

Seems like @tomhanak's approach is very similar to ours considering this is probably the best one can do in terms of transferring and writing data.

Key notes:

The file handle must be opened with CreateFile FILE_FLAG_NO_BUFFERING command.
This is a very powerful but dangerous method. For it to work we must make sure that the data is all sector aligned to the filesystem sector sizes.
The largest data chunk the WriteFile method can accept is 2^32 - 1.
This is not sector-alligned, so the maximum we can write will be a multiple of the sector size (i.e 512B)
Since the Write Method does not know where the end of the buffer is if you are writing a file larger than 4GB, you MUST keep track of how many bytes you've written, and edit the dwBytesToWrite value to match (albeit aligned match) the remaining bytes to write, otherwise it will just blindly keep writing beyond the memory you intended.
Using _aligned_malloc( size_wanted, alignment) for allocation and must use _aligned_free() after using them.
I found that disabling the windows file indexing that Henry pointed out above made a huge difference in the writing speeds.

marktsuchida commented 1 year ago

Ah, it makes sense that writing with no buffering would be better than memory mapping. That is a crucial detail!

Since you are proposing to store raw data without TIFF format (something I missed earlier since NDTiff is mentioned so many times :), I have no problems with this living in MMCore.

In fact, it should live in MMCore for buffering reasons. Presumably the writing will happen at its own pace, and images for display (if desired) will be sampled at a configurable (or feedback-regulated) interval and placed in a second sequence buffer for the application to retrieve at its leisure.

It will be good to look up the corresponding system calls on Linux (O_DIRECT?) and make sure to design for eventual Linux support. Probably not that hard, just need to work with potentially different write block sizes, etc.

Also, the nice thing about raw array files is that they can become Zarr datasets just by adding the appropriate metadata files. It would be nice if the MMCore API that performs this saving is designed so that it is convenient for the caller to do this (not sure if anything special is needed; just bringing up the possible use case). Similarly, adding the metadata files to construct an "NDTiff" dataset can also be left to Java or Python client code, where it will be much more comfortable to do than in C++ (not sure if extending NDTiff to do non-TIFF files has any advantage over using something Zarr-based).

Perhaps not a priority for the first iteration, but if saving per-frame metadata (for example, a timestamp produced by camera hardware), it might make sense to have MMCore save it, also to raw array files (which can also become part of, say, a Zarr dataset). Given that many of our camera adapters produce per-frame metadata that is not so useful to save in a high-data-rate scenario, it probably makes sense for the application code to select the metadata keys to record in such a way. Alternatively, streaming metadata to Python/Java may also work.

henrypinkard commented 1 year ago

Also, the nice thing about raw array files is that they can become Zarr datasets just by adding the appropriate metadata files. It would be nice if the MMCore API that performs this saving is designed so that it is convenient for the caller to do this (not sure if anything special is needed; just bringing up the possible use case). Similarly, adding the metadata files to construct an "NDTiff" dataset can also be left to Java or Python client code, where it will be much more comfortable to do than in C++ (not sure if extending NDTiff to do non-TIFF files has any advantage over using something Zarr-based).

I think Zarr and the variant of NDTiff I'm proposing (which doesn't actually have anything to do with Tiff -- maybe NDRaw is a better name? ) are essentially the same thing. Blocks of data in files, and some index file describing where everything is. The advantages of using the ND* library as I see it are: all of the NDTiff compatible code already works with it, so it it integrated in tests, data loading clasess, etc. This would be more a contraction than expansion of the format/library--just removing the Tiff metadata, which is not used by any of our codebase for accessing data anyway. Though I'm not familiar with all the details of the latest version of zarr, in theory it probably wouldn't be too difficult to have the data be accessible in either format (just write a different or second index file)

marktsuchida commented 1 year ago

From a different viewpoint, Zarr is used in so many other places and by so many other people that it will open up far more opportunities for interoperability than ND* will in the short term. I'm generally against reinventing the wheel where possible (and wonder if your NDRaw could not be implemented as a specific case of Zarr). But Zarr may not always be applicable either.

The major point I was trying to make is that file metadata formats (whether Zarr of NDRaw) are a higher-level concern and can be handled separately and outside of the Core. It seems prudent to avoid hard-coding a specific format in the Core (imagine how inconvenient it would have been if we had hard-coded TIFF into the Core 10 years ago -- at that time we would have had a good performance argument for doing so). Roughly speaking, the Core API can allow user code to choose the size, chunking, etc., and file naming pattern (for the raw array files), and provide feedback on what was actually saved (in case acquisition was interrupted). Then the app can add the necessary NDRaw or Zarr metadata files to complete the dataset.

henrypinkard commented 1 year ago

From a different viewpoint, Zarr is used in so many other places and by so many other people that it will open up far more opportunities for interoperability than ND* will in the short term. I'm generally against reinventing the wheel where possible (and wonder if your NDRaw could not be implemented as a specific case of Zarr).

I think it probably could (or maybe already is?) a specific case of zarr. @cgohlke posted code the other day showing how to open it with zarr, so maybe this is a moot point.

The major point I was trying to make is that file metadata formats (whether Zarr of NDRaw) are a higher-level concern and can be handled separately and outside of the Core. It seems prudent to avoid hard-coding a specific format in the Core (imagine how inconvenient it would have been if we had hard-coded TIFF into the Core 10 years ago -- at that time we would have had a good performance argument for doing so). Roughly speaking, the Core API can allow user code to choose the size, chunking, etc., and file naming pattern (for the raw array files), and provide feedback on what was actually saved (in case acquisition was interrupted). Then the app can add the necessary NDRaw or Zarr metadata files to complete the dataset.

I see what you mean, and I agree on avoiding hard-coded assumptions at the lowest level. However, one issue I see with this is that the module is incomplete without a component from another language. I think there is a major advantage to having a default indexing mechanism, both for programmer ease of use, and to prevent datasets being rendered unusable if a user messes up something in a higher level language that loses track of which images are where.

Hard-coding a TIFF is not a perfect analogy here, because you cannot "un-TIFF" a TIFF file. it has a more restricted format, and there is metadata interspersed throughout the file along with image data. In contrast, what I'm describing here is just having a default index file that is written by the c++ layer alongside the image/metadata. This would be a pretty small amount of code (so little to no maintenance burden) and it could be easily turned off and handled by a higher level language if desired, without affecting how the image and metadata is written.

marktsuchida commented 1 year ago

In that case I would vote for Zarr, because there are libraries in many languages to read it, and we don't need to do anything to maintain it. But I'm more worried about the metadata format we adopt on top of Zarr: it would be best if the Core only saves the bare minimum (mostly just array shape and datatype). It should be the app that adds any acquisition information (or whatever else that NDTiff adds), whether by modifying the Zarr JSON afterwards or by providing extra metadata to the Core (which will only know that they are Zarr attributes). Putting knowledge of this stuff in the Core would probably make it cumbersome to evolve. I think this makes sense anyway, because it's the app, not the Core, that has this information to begin with.

If that doesn't sound sufficient, then I think I need more details on what you mean by "default index".

marktsuchida commented 1 year ago

A slightly separate topic: Given that buffers for saving need to be sector-aligned, one problem is that if the size of a single image is not an exact multiple of the sector size, it will either need to be padded (incompatible with Zarr?), or else the smaller-than-sector remainder will need to be copied to a new buffer and saved together with the next image. I don't see any problem with the latter approach (especially since multiple images may be combined before saving anyway); just something to be aware of.