ome / bioformats

Bio-Formats is a Java library for reading and writing data in life sciences image file formats. It is developed by the Open Microscopy Environment. Bio-Formats is released under the GNU General Public License (GPL); commercial licenses are available from Glencoe Software.
https://www.openmicroscopy.org/bio-formats
GNU General Public License v2.0
375 stars 241 forks source link

ZeissCZIReader performance issue with Lattice Light Sheet #3839

Closed NicoKiaru closed 1 year ago

NicoKiaru commented 2 years ago

Hello everybody,

We acquired a LLS system recently from Zeiss and a CZI File can go up to a Tb without too much difficulties. (for instance 2 channels, 1024 x 450 pixels, 2000 slices, 300 timepoints).

It's big, but normally these two gigantic stacks which can be opened virtually without too much issues with ImageJ.

The problem comes from the initialisation of the ZeissCZIReader, it just takes forever. I don't know how long it took for a Tb file. The only thing I know is it is less thant a night, because I opened the file in the evening and it was opened the next morning...

I started to monitor the performance of the reader, and the first bottleneck is in readSegments(id). Maybe it is the only bottleneck, because it was not patient enough to let it finish. A quick estimation of the readsegments method execution time for my Tb file is ~ 3 hours.

Do you know if there's a way to avoid this long step at the beginning ? Do you know how the reader by Zen works ? Is there a possibility to mimick it ?

ping @sebi06 @dgault

Opening the file on Zen takes around 5 seconds.

NicoKiaru commented 2 years ago

All the time is spent in NIOByteBufferProvider.allocate

image

dgault commented 2 years ago

Was that screenshot of the profile for a smaller file which completed or just a sub section of the init time for the larger file? We would probably need to do some thorough profiling to confirm if that is the only bottleneck.

At the minute the Bio-Formats readers saves the startPosition for each segment and when reading each will open a stream and seek to that position, it may be possible to avoid a lot of that seeking but I would need to check to confirm and it is probably not a small change.

NicoKiaru commented 2 years ago

Hello @dgault ,

Was that screenshot of the profile for a smaller file which completed or just a sub section of the init time for the larger file?

The profiling was made for the sub-section of the init time of a large file, which did not complete. Happy to help if possible. I have a gigantic Tb file, but I can make a manageable subset of it and share it. I also have a branch with a few logs here and there to test the opening speed.

NicoKiaru commented 1 year ago

I think the problem comes from the reader which reads all blocks of the file linearly. When going through Tb size files, it's not efficient. I'm digging through the current reader implementation and through the specs in https://zeiss.github.io/libczi/index.html, and I'll try to come with a more efficient indexing by using the informations located in the sub-block directory. That's probably going to be painful given the size and use cases for this reader, but I'll give it a shot.

dgault commented 1 year ago

Thanks @NicoKiaru, if you think you have a potential solution feel free to open the PR and we can get it tested against the existing datasets that we have

NicoKiaru commented 1 year ago

Very complicated this reader... Pretty sure I won't be able to do something as general as what it currently supports.

Do you think one of you could details a little bit the logic behind the openBytes method ? I really have a hard time understanding what's happening.

https://github.com/ome/bioformats/blob/b660fdc08ced3a6b7307d0ea6bd8885e405c82e8/components/formats-gpl/src/loci/formats/in/ZeissCZIReader.java#L340-L498

melissalinkert commented 1 year ago

Most of the scary-looking logic in openBytes is for handling whole slide data and/or the results of tile stitching. In these cases there will be a bunch of "extra" subblocks that openBytes shouldn't care about - the subblocks stored in the files are not just the pixel data tiles alone, there will be additional ones that define (but don't have pixel data for) each pyramid resolution etc. openBytes is also only reading pixel data from subblocks that represent tiles within the requested region, but for whole slide/tiled data the number of tiles to read isn't predictable for a fixed area. Tiles often overlap unpredictably, which is another part of why openBytes iterates over every SubBlock.

All of this obviously varies quite a bit based on the imaging modality and actual acquisition settings. You'll note that in various points throughout the reader, there are some special cases for specific imaging types (isPALM, scanDim/validScanDim for some tiled data, etc.). If there is a reliable way to detect lightsheet data from the metadata, a similar special case in initFile/openBytes might make things easier for now.

NicoKiaru commented 1 year ago

Quick update on this issue: there is (will be) an alternative reader which should parse the metadata faster. Its implementation is in https://github.com/BIOP/quick-start-czi-reader . It will be added as an external reader as soon as this PR is accepted.

Long term goal: if the new reader works well and can open the czi file as well and in similar enough way than the old reader, the new one may replace the old one.

NicoKiaru commented 1 year ago

I do not think it's necessary to keep this issue opened, see message above