[FEATURE] io: output data in tiff format.

mattersoflight commented 1 year ago

Problem @talonchandler alerted me that the data in zarr format is often not useable by users, because it is not the input accepted by widely used image analysis pipelines that consume the output of recOrder.

Proposed solution A) Provide an option to write the data in ome.tiff format, but avoid creating massive ome.tiff dataset that is split across many files. Our original reason to avoid tiff format was that partial loading of large ome.tiff data is inherently slow. OR B) Continue to invest effort in zarr format and write thin layers for making the data useable in other image analysis pipelines. Document on wiki how to work with zarr format for different use cases as we encounter, but encourage the users to switch to a more modern file format.

Additional context @talonchandler can you add links to some datasets we can browse to think about if option B is a viable path?

talonchandler commented 1 year ago

I will nominate this ~1TB dataset for benchmarking and testing: it contains both ome.tiff files and zarr formats: /hpc/projects/compmicro/rawdata/hummingbird/Janie/2022_03_15_orgs_nuc_mem_63x_04NA/

As a variant to (A): does the lab have any experience with BigTiff formats? If you use tifffile.imwrite to write a .tif file larger than 4 GB, it will automatically write a file with a .tif suffix in BigTiff format. I've had good read/write experiences with this format with files as large as 100 GB, since tifffile plays so nicely with it.

Regarding (A): I 100% agree that loading large ome.tiff data is slow...even the (very good!) WaveorderReader took about 90 seconds to construct the reader for the dataset above since it had to traverse the whole dataset (it only needed a few seconds to construct the zarr reader).

But I might suggest that both of these formats have tradeoffs compared to saving each position as a BigTiff. ome.tif files are subject to 4GB constraints, so a position might be split across multiple files and you can't easily drag-and-drop (except into micromanager AFAIK). zarrs are best in class for remote access of large datasets, but we're finding that our use cases aren't that frequent and that our users' pipelines rarely support it. I think BigTiff formats hit a sweet spot of (1) supporting easy browsing with drag-and-drop into napari, ImageJ, etc, (2) good speed tradeoffs for loading (a few seconds to read a single position), and (3) very good support for our users.

I have converted the dataset above into BigTiffs at this location if you would like to test the result: /hpc/projects/compmicro/sandbox/Talon/testio/all_21_3

mattersoflight commented 1 year ago

After the discussion with the group, we decided to write data as tiff stacks, each of which will have XYZ dimensions. The stacks will be split across channels, time, and positions. We should assess a) how to load this data lazily and efficiently via dask to visualize in napari and b) how to load this data in Fiji as hyperstack.

We need to incorporate OME-TIFF tags necessary for Fiji to parse this data.

mattersoflight commented 1 year ago

This issue is now copied to https://github.com/czbiohub/iohub/blob/main/docs/transferred_issues using gh2md package. We will continue the discussion there. Closing here.

mehta-lab / recOrder

[FEATURE] io: output data in tiff format. #276