efficient and intuitive data reader & writer module

mattersoflight commented 3 years ago

Current decisions:

New pipeline (recOrder/waveOrder) won't support MM 1.4.22.
We will save data in performant multi-page/ND image formats: ome.tif for acquisitions with mm2-gamma, .zarr for analysis.
We will add readers to allow re-analysis of existing data (single-page tiffs) with new algorithms. If we have other infrequent or edge cases, we can write simple converters to ome.tif data.
(subject to further discussion) Plots meant for exploration need to be easy to browse. We have settled on .png for now.
All of our current data is either in ome.tif or single page tiffs. We will focus on them for now, and not consider other multi-page tiff formats.

Current to do list:

[ ] write a reader for multi-page ome.tif acquisitions.
[ ] compute with calls to waveorder.
[ ] write output to .zarr
[ ] write a reader for single-page ome.tif.

Description at the start:

AICSimage reader combines features of intensity data structure, metadata module, and allows intuitive slicing. Is it the best metadata parser for OME-TIFF? It is slow at slicing the data.

The reader should be compatible with MM2-gamma acquisitions we have from QLIPP and Falcon in this version.

Storing all the data (image volumes, Stokes volume, physical volume) in ND-arrays makes the most sense.

@ieivanov can review and test with Falcon data.

mattersoflight commented 3 years ago

We are leaning towards zarr or OME-TIFF to store all image data other than input. Zarr's compression is reported to range 1.5x - 16x. @bryantChhun can you investigate?

ieivanov commented 3 years ago

We need a reader that can efficiently extract specific slices from a large dataset (e.g. channel 1, z slices 3-13, all (5) time points) and return data with predictable dimentions. AICSimage uses STCZYX order, which I like. For convenience, we may choose to drop S (aka position) and only use TCZYX. Empty dimensions will have size one.

aicsimageio does all that. I had a problem once where I asked for a subset of a large dataset, but aicsimageio first loaded the entire dataset and then returned only the slices I asked for. That shouldn't not be the case, we should see if it's still a problem.

The reader should also access the image metadata. OME-TIFF files store metadata internally (there is no separate metadata file). aicsimageio can read the OME-TIFF metadata. I extended the reader in aicsimageio to also return micromanager metadata: https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader. The metadata reader may not be fully debugged.

aicsimageio does not currently provide a writer class. I've been writing ome.tif stack using tifffile. It works well for the most part. One problem is that it is not easy to append data to an existing ome.tif file. New data is written to a separate series, rather than extending the current series. Appending data to the current series required modification of the ome-tiff metadata that's already written. It can be done, but it's not straightforward.

With limited experience I've found that it's easy to append data to a zarr file. Do zarr files support storing external metadata?

mattersoflight commented 3 years ago

@bryantChhun, @ieivanov zarr metadata is a JSON object. https://zarr.readthedocs.io/en/stable/spec/v1.html#metadata. The format requires some fields, and allows extra fields. We can append all of OME-TIFF metadata (also JSON). The N5-viewer parses JSON (https://github.com/saalfeldlab/n5-viewer).

bryantChhun commented 3 years ago

We are leaning towards zarr or OME-TIFF to store all image data other than input. Zarr's compression is reported to range 1.5x - 16x. @bryantChhun can you investigate?

Exactly how Loic gets 16x compression is not clear. He'll have to chime in. It sounds like there are some caveats to using aicsimageio vs straight zarr+custom JSON metadata. What kind of tests need to be performed or questions answered to zero in on one?

Another approach is to abandon fancy structures like aicsimageio and simply build a dictionary that maps coordinates to filenames, then introduce performant data structures later.

mattersoflight commented 3 years ago

@bryantChhun good test cases for reader/writer module are:

read a XYZCT (C=polarization channels) from @ieivanov's experiments, as well as the instrument tensor (XYC), apply the instrument tensor to compute Stokes volumes (XYZCT), and save them as zarr files.
read XYZCT data from your experiment on microglia, apply the instrument matrix, and save Stokes volumes as zarr.

Are both datasets compliant with OME-TIFF? I think it is optimal that the reader maps files to dimensions of acquisition using metadata, and then allows slicing into ND-data, which is what aicsimageio does.

I think the best thing to do is to track down the issue that makes aicsimageio inefficient. If it can be fixed, we should fix it here: https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader, and generate pull request.

We can carry metadata with zarr and add processing parameters to it, which can be useful provenance.

bryantChhun commented 3 years ago

read a XYZCT (C=polarization channels) from @ieivanov's experiments, as well as the instrument tensor (XYC), apply the instrument tensor to compute Stokes volumes (XYZCT), and save them as zarr files.

read XYZCT data from your experiment on microglia, apply the instrument matrix, and save Stokes volumes as zarr.

I like these two use cases for initial testing. But I also see the tensor/matrix implementation and stokes compute as out of scope for the reader/writer module. Those functions should belong in compute?

Are both datasets compliant with OME-TIFF? I think it is optimal that the reader maps files to dimensions of acquisition using metadata, and then allows slicing into ND-data, which is what aicsimageio does.

Sounds good. the microglia dataset was pure individual-tiffs. The portion of the reader that loads this data type could convert them into aicsimageio data format or OME-TIFF object? I'll look into it.

I think the best thing to do is to track down the issue that makes aicsimageio inefficient. If it can be fixed, we should fix it here: https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader, and generate pull request

I'll explore aicsimageio a bit more to see what i can do.

We can carry metadata with zarr and add processing parameters to it, which can be useful provenance.

Ah, this is another approach that is parallel to aicsimageio. I'll explore this.

look into modules to convert micro-manager individual tiffs into an OME-TIFF object, a zarr object, or aicsimageio
Explore the key problems with aicsimageio: large data loading, lack of file writer.
Compare aicsimageio with zarr+metadata approach

mattersoflight commented 3 years ago

I like these two use cases for initial testing. But I also see the tensor/matrix implementation and stokes compute as out of scope for the reader/writer module. Those functions should belong in compute?

I agree, these functions belong to compute. I like the idea of writing functions in each module, unless they need to be a class.

Sounds good. the microglia dataset was pure individual-tiffs. The portion of the reader that loads this data type could convert them into aicsimageio data format or OME-TIFF object? I'll look into it.

Did you acquire microglia data with mm2-gamma? We can find another experiment if that was with mm2-beta and the metadata parsers need to be different. It will be awesome if aicsimageio can return an object that can represent acquisition saved as single page TIFFs.

I think the best thing to do is to track down the issue that makes aicsimageio inefficient. If it can be fixed, we should fix it here: https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader, and generate pull request

I'll explore aicsimageio a bit more to see what i can do.

We can carry metadata with zarr and add processing parameters to it, which can be useful provenance.

Ah, this is another approach that is parallel to aicsimageio. I'll explore this.
1. look into modules to convert micro-manager individual tiffs into an OME-TIFF object, a zarr object, or aicsimageio
👍

Explore the key problems with aicsimageio: large data loading, lack of file writer. 👍

Compare aicsimageio with zarr+metadata approach

On point 3 - the data that comes out of QLIPP and Falcon will be OME-TIFF and not zarr. I am suggesting zarr for all intermediate and output files. Plots meant for exploration should be png.

bryantChhun commented 3 years ago

Did you acquire microglia data with mm2-gamma? We can find another experiment if that was with mm2-beta and the metadata parsers need to be different. It will be awesome if aicsimageio can return an object that can represent acquisition saved as single page TIFFs.

I have one dataset on gamma (october this year), but all other datasets were on 1.4.22. Agreed about the single page TIFFS to AICSImageIO. I'll see if there's an easy converter or importer. The metadata is all there.

On point 3 - the data that comes out of QLIPP and Falcon will be OME-TIFF and not zarr. I am suggesting zarr for all intermediate and output files. Plots meant for exploration should be png.

Ah, ok so AICSImageIO might be just an internal data structure that isn't responsible for writing, only reading. This makes things easier I think...

ieivanov commented 3 years ago

On point 3 - the data that comes out of QLIPP and Falcon will be OME-TIFF and not zarr. I am suggesting zarr for all intermediate and output files. Plots meant for exploration should be png.

Ah, ok so AICSImageIO might be just an internal data structure that isn't responsible for writing, only reading. This makes things easier I think...

I second that. Falcon can currently only save data as OME-TIFF. We'll need a reader that can parse through that. Writing of results can happen in a different format, zarr looks promising so far.

A few more points on this topic:

pycromanager can be paired with a custom data writer. Probably not worth the effect in this point, but it's good to keep in mind that it is an option.
Zarr datasets take time to compress and decompress. Heavy compression make take even longer. Results saved in zarr format should still be easy to load up and view frequently. We could explore fast intermediate compression levels for data that are still in flux and slower heavier compression for archival purposes.
I think it's very important that processed data can be viewed in Fiji. When I tried to load a zarr file with the N5 viewer I shared, the viewer had a problem decompressing the file and the file did not open.

@bryantChhun, here are two datasets you can work with:

comp_micro/rawdata/falcon/Ivan/20201116 LF + Actin Flor Aniso/LF+Actin_FOV1_1 shape: (21, 8, 33, 2056, 2464) dims: 'TCZYX' size: ~53 GB
comp_micro/rawdata/falcon/Ivan/20201026 Two Channel Fluor Aniso/Sir-Actin FM 4-64 Ex-630-470 Em-676-29 Collagen_1 shape: (8, 45, 2056, 2464) dims: 'CZYX' size: ~3.4 GB

Using the bioFormats reader in MATLAB (https://docs.openmicroscopy.org/bio-formats/6.1.0/users/matlab/index.html), I can load the ND slice [slice(3), 1, slice(4,14), :, :] (i.e. 3 timepoints, second channel, 10 z slices, total of 30 images) from the LF+Actin_FOV1_1 dataset in ~ 15 seconds over 10 Gbps ethernet connection on Falcon. The bioFormats reader for MATLAB is relatively slow, we're looking for something that's on par or faster.

ieivanov commented 3 years ago

@camFoltz has also looked into these issues, he may be able to provide further insight

camFoltz commented 3 years ago

I think it's very important that processed data can be viewed in Fiji. When I tried to load a zarr file with the N5 viewer I shared, the viewer had a problem decompressing the file and the file did not open.

I have experienced this same issue. It seems that the N5 viewer is unable to load or find datasets within my .zarr container. I am however able to open them with no problems in Napari/python. It may be worth opening up an issue on the N5 viewer github n5 viewier github link

ieivanov commented 3 years ago

Looks like Jackson Maxfield from the Allen Institute is actively working on the ome tif reader: https://github.com/AllenCellModeling/aicsimageio/tree/feature/ome-tiff-reader.

It seems like for immediate reading, the dataset is first loaded fully into memory: https://github.com/AllenCellModeling/aicsimageio/blob/fe421c311c303c9f042aa49b8156be20b10288b5/aicsimageio/readers/ome_tiff_reader.py#L339 and then the requested dimensions are returned: https://github.com/AllenCellModeling/aicsimageio/blob/fe421c311c303c9f042aa49b8156be20b10288b5/aicsimageio/readers/ome_tiff_reader.py#L363-L371.

Delayed (dask) reading may allow loading only the requested data: https://github.com/AllenCellModeling/aicsimageio/blob/fe421c311c303c9f042aa49b8156be20b10288b5/aicsimageio/readers/tiff_reader.py#L162, we should test that

ieivanov commented 3 years ago

P.S. Looks like there are writers for OME-TIF files too

ieivanov commented 3 years ago

P.P.S. In this convention data are returned in TCZYX dimension order (down from STCZYX). Different position are written as different Scenes in the OME-TIFF file. In this version, S=0 is selected unless otherwise specified. I think this makes a lot of sense

ieivanov commented 3 years ago

https://github.com/AllenCellModeling/aicsimageio/tree/feature/ome-tiff-reader#quickstart-notes explains imaging loading. dask reading should load only the data you ask for. Hopefully it doesn't introduce large overhead. It sounds like in the future they may implement immediate reading of specific slices of the data too.

This ome-tif reader also has better metadata parsing. It seems like it returns all metadata in the ome-tif file, which will include MM metadata

bryantChhun commented 3 years ago

Ok here's my plan, please chime in if you have thoughts:

Write functions to read the various expected data formats -- use demo data starting from mm2-gamma, OME-tiff, individual tiffs, (both large multidims, t-p-z-c = 50-4-5-4 @ 2k x 2k)
Consider writing pytests for 1 that pull from google drive (as we did for reconorder 1.0)
Check functions from 1 on real data (from Ivan, Bryant): test for reading, metadata, dimension order (anything else?)
Write functions to write the data: needs exploration/evaluation look into .zarr for:
- ability to associate metadata
- ability to view in Fiji
- compression

bryantChhun commented 3 years ago

Delayed (dask) reading may allow loading only the requested data: https://github.com/AllenCellModeling/aicsimageio/blob/fe421c311c303c9f042aa49b8156be20b10288b5/aicsimageio/readers/tiff_reader.py#L162, we should test that

I investigated this quite a bit using two datasets generated from mm-demo-config:

P/T/C/Z = 4/50/4/5 2k x 2k images. (LARGE) --8x 4gb files, two files per position
P/T/C/Z = 4/10/4/5 2k x 2k images. (SMALL) --4x 1.7gb files, one file per position

Conclusion: Dask lazy loading time scales with the data file size, not with the size of the slice. For large datasets, the overhead can be significant (13-18 seconds to extract [4, 2048, 2048] array).

Here's a doc that summarizes conditions and results (run on laptop): gdrive image reader performance

And the notebooks for the tests are here: sandbox - decOrder

I haven't explored GPU acceleration using Dask ( think CUDA- CuPy is the only backend implementation), which could significantly boost its slice time.

For now i'm going to write the functions using dask delayed reading. This is easy enough for OME-tiff but harder for multiple single-page-tiffs.

mattersoflight commented 3 years ago

@bryantChhun very interesting benchmarks. From your notebook, I understand that

the tifffile is the library used by both AICSimageio and the dask loader you wrote.
Reading subset of data is more efficient with AICSimageio instead of doing lazy loading with dask.

Correct?

mattersoflight commented 3 years ago

Write functions to write the data: needs exploration/evaluation look into .zarr for:
* ability to associate metadata

* ability to view in Fiji

* compression

There is an example in pycro-manager repository: https://pycro-manager.readthedocs.io/en/latest/application_notebooks/convert_MM_MDA_data_into_zarr.html. But it does not touch on metadata.

bryantChhun commented 3 years ago

the tifffile is the library used by both AICSimageio and the dask loader you wrote.

Yes that's correct

Reading subset of data is more efficient with AICSimageio instead of doing lazy loading with dask.

For small datasets, I think AICSImageio is faster by maybe 20%. For Large datasets, it might be slower by 40-50%. I'm not sure the 20% difference is significant as it seems to be running faster today than yesterday (more free memory on my local laptop?) -- will have to rerun all the tests to confirm. The difference for large datasets might make sense -- Dask slicing has no knowledge of the meaning of the dimensions but AICSImageIO has some interpretation and reordering overhead.

bryantChhun commented 3 years ago

There is an example in pycro-manager repository: https://pycro-manager.readthedocs.io/en/latest/application_notebooks/convert_MM_MDA_data_into_zarr.html. But it does not touch on metadata.

In this example, the approach is to open micro-manager, load the tiffs there, then use pycromanger to pull that data into python then save. Micro-manager will load the entire stack if even one individual tiff (not ome-tiff) is dragged into the GUI -- this is nice but not obvious how to integrate into our work.

One idea is to package the necessary micro-manager .jars with decOrder, then call those by subprocess.call with CLI flags. This is not obvious considering micro-manager does not have headless mode. ImageJ/Fiji does have headless mode, but so far bioformats importer does not support mm2-gamma.

Here is the file loading logic implemented in micro-manager.

Another idea is simply to re-implement that SinglePlaneTiffSeries data store in python. In the above link, the datastructure builds a hashmap of coordinates-to-filenames and pulls from this whenever a user/viewer requests a certain coordinate. We can do the same, effectively creating our own "lazy-loader" but for a series of single-page-tiffs.

mattersoflight commented 3 years ago

Hi @bryantChhun, It would be excessive overhead to include micro-manager jars in our package or to re-implement the data store from micro-manager. Quoting @ieivanov's observation "This (AICSimageio) ome-tif reader also has better metadata parsing. It seems like it returns all metadata in the ome-tif file, which will include MM metadata" .

My intent with sharing pycro-manager example was to share an example of how a reader that returns both data and metadata from OME-TIFF be used to write ND-zarrs. Sorry if my comment sent you on a tangent.

mattersoflight commented 3 years ago

For small datasets, I think AICSImageio is faster by maybe 20%. For Large datasets, it might be slower by 40-50%. I'm not sure the 20% difference is significant as it seems to be running faster today than yesterday (more free memory on my local laptop?) -- will have to rerun all the tests to confirm. The difference for large datasets might make sense -- Dask slicing has no knowledge of the meaning of the dimensions but AICSImageIO has some interpretation and reordering overhead.

@bryantChhun I see that AICSimageio+ dask object is fast for small datasets. But, I don't see a comparision of dask+AICSimageio vs direct dask loading in the google sheet.

small dataset:

large dataset: runtime with AICSimageio+ dask is not in the google sheet.

ieivanov commented 3 years ago

@bryantChhun, thanks for running these test.

Can you confirm which version/branch of aicsimageio you used? Based on the calls to get_image_data("CYX", Z=0, S=0, T=0) it looks like you've used the master branch. The feature/ome-tif-reader branch (https://github.com/AllenCellModeling/aicsimageio/tree/feature/ome-tiff-reader) has an improved API and may provide some enhancement to file reader. This brach is staged for merging into v4.0 of aicsimageio, it will be good to test how image reading performs there.

You're right that the AICSImage object uses OmeTiffReader behind the scenes. AICSImage assigns a reader based on the file format, so it makes sense that calls to AICSImage and OmeTiffReader give very similar performance for ome-tiff files. I think in our code, we should use the parent AICSImage object, in theory this will allow us to switch to a different raw data file format such as TIF or CZI, for example.

For files that do fit in memory, it'll be good to know how dask reading compares to direct file reading. In your comparison with the small dataset get_image_data("CYX", Z=0, S=0, T=0) averaged 1.62 s ± 4.85 s per loop. Do you know where the large standard deviation comes from? get_image_dask_data("CYX", Z=0, S=0, T=0).compute() was much quicker at 699 ms ± 71.4 ms per loop, indicating that data is cashed. Can you run a comparison where you create the AICSImage object at the beginning and then use it to loop over the Z dimension and load data using get_image_data or get_image_dask_data? That should be a very common use case for us.

Calls to get_image_data currently load the entire dataset and then return only the slice you asked for. This, of course, is very inefficient, and is likely why that test crashed (32 GB may not fit in your laptop memory). I don't think get_image_data is a good option for us, hopefully future releases will implement smarter immediate data loading. I think for now we'll have to rely on dask arrays

bryantChhun commented 3 years ago

Can you confirm which version/branch of aicsimageio you used? Based on the calls to get_image_data("CYX", Z=0, S=0, T=0) it looks like you've used the master branch. The feature/ome-tif-reader branch (https://github.com/AllenCellModeling/aicsimageio/tree/feature/ome-tiff-reader) has an improved API and may provide some enhancement to file reader. This brach is staged for merging into v4.0 of aicsimageio, it will be good to test how image reading performs there.

Good catch! I was using the pypi version, which is dated to dec 13th (and should be off master branch). I'll try the ome-tif-reader branch next.

You're right that the AICSImage object uses OmeTiffReader behind the scenes. AICSImage assigns a reader based on the file format, so it makes sense that calls to AICSImage and OmeTiffReader give very similar performance for ome-tiff files. I think in our code, we should use the parent AICSImage object, in theory this will allow us to switch to a different raw data file format such as TIF or CZI, for example.

Sounds good to me. Also, I stumbled upon this closed thread in the issues. It also links to the "Mission and Values, Roadmap, and Governance" in the docs directory. They seem committed to making this extensible with a stable API (allowing community contribution of readers). I was in the process of implementing a dask reader for normal single page tiffs (not ome-tiff) output from micro-manager gamma, mostly to test whether the time issues were with the file format or with dask reading (maybe the above branch solves this). If addition of readers is simple enough, we could contribute a micro-manager reader.

For files that do fit in memory, it'll be good to know how dask reading compares to direct file reading. In your comparison with the small dataset get_image_data("CYX", Z=0, S=0, T=0) averaged 1.62 s ± 4.85 s per loop. Do you know where the large standard deviation comes from? get_image_dask_data("CYX", Z=0, S=0, T=0).compute() was much quicker at 699 ms ± 71.4 ms per loop, indicating that data is cashed. Can you run a comparison where you create the AICSImage object at the beginning and then use it to loop over the Z dimension and load data using get_image_data or get_image_dask_data? That should be a very common use case for us.

I think large deviations suggest the result is cached -- so the first of the %timeit runs may be long but all subsequent will be fast. It shouldn't cache between calls to %timeit, but this is easy to test ... just run the same line twice. I'll redo a handful of these with the new branch and see how it goes.

Calls to get_image_data currently load the entire dataset and then return only the slice you asked for. This, of course, is very inefficient, and is likely why that test crashed (32 GB may not fit in your laptop memory). I don't think get_image_data is a good option for us, hopefully future releases will implement smarter immediate data loading. I think for now we'll have to rely on dask arrays

agreed

ieivanov commented 3 years ago

If addition of readers is simple enough, we could contribute a micro-manager reader.

Adding readers is fairly simple, the API is very straightforward. I added a micro-manager reader (https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader) based on the OmeTiffReader class, which also pulled the micro-manager metadata. I think we don't need this reader anymore, as the new version of the OmeTiffReader pulls all metadata, including micro-manager metadata (tho I have to double check that). Adding a TiffSequenceReader (or something similar) based on the TiffReader class shouldn't be hard. It will be a nice feature if this reader can guess the dimensions based on the file names with clues like z%d or t%d, this will make it more robust to changes in the MM naming conventions. Using dask, we may be able to employ parallel reading of individual files over multiple workers, which will speed us file loading (which is slower compared to loading a ome.tif). Let me know if you'd like to work on this together.

bryantChhun commented 3 years ago

Adding readers is fairly simple, the API is very straightforward. I added a micro-manager reader (https://github.com/czbiohub/aicsimageio/tree/feature/micromanager_reader) based on the OmeTiffReader class, which also pulled the micro-manager metadata. I think we don't need this reader anymore, as the new version of the OmeTiffReader pulls all metadata, including micro-manager metadata (tho I have to double check that). Adding a TiffSequenceReader (or something similar) based on the TiffReader class shouldn't be hard. It will be a nice feature if this reader can guess the dimensions based on the file names with clues like z%d or t%d, this will make it more robust to changes in the MM naming conventions.

I just pushed some functions to the branch data_reader_writer. These functions enable reading of the folder structure for "save individual file" format from micro-manager 2.0. It's based on metadata.txt parsing rather than filename parsing. I've only tested this in jupyter notebooks.

The problem i've hit is that dask loading of a large number of .tiff files (33 GB), following the pattern used by aicsimageio tiffreader , doesn't delay the reading -- it reads them immediately, causing memory crash issues just like get_image_data.

The original purpose of developing this sequence reader was twofold:

Investigate whether the ome-tiff file format inherently caused slow lookups/slicing compared to individual files loaded to dask. This is plausible considering the linked-list data structure of ome-tiff, preventing indexing or hashing on specific image coordinates. But also, the changes in the branch you mention above may fix this.
Enable reading of the other micro-manager file writing mode.

If we decide that number 2 is not important, that we should enforce all users save as/convert to ome-tiff, and if the new ome-tiff-reader branch is performant, we can abandon writing this tiff sequence reader _for _now__. Instead we can focus 100% on just .zarr image and metadata writing (the reader will simply be AICSImageIO).

Using dask, we may be able to employ parallel reading of individual files over multiple workers, which will speed us file loading (which is slower compared to loading a ome.tif). Let me know if you'd like to work on this together.

I think parallel reading of files is very important and can help enable multiprocessing too. File reading is very much an OS level operation and can be done with simple asyncio routines. Maybe you can take a stab at writing a worker class to interface with async file reading and writing? I'll look at the ome-tiff branch and dive into .zarr file and metadata today or tomorrow.

ieivanov commented 3 years ago

If we decide that number 2 is not important, that we should enforce all users save as/convert to ome-tiff, and if the new ome-tiff-reader branch is performant, we can abandon writing this tiff sequence reader for now. Instead we can focus 100% on just .zarr image and metadata writing (the reader will simply be AICSImageIO).

Remember that with the IPS I can only write to ome.tif files, writing individual images to separate files is just too slow. I think our workflow should be to always save data as ome.tif and have a single-page tif reader for backward compatibility. In that case, I think we shouldn't invest too much effort in optimizing the single-page tif reader, and instead can focus on making the ome-tif reader more performant, if needed. What do you think?

camFoltz commented 3 years ago

Adding to Ivan's point, any custom acquisition notebook that uses pycromanager will also be saved as ome.tiff stacks. The only reason that I collect data as individual ome.tiff files is because that is the only way I can use reconstruct-order. As we moved to this new compute infrastructure, it will not be necessary to save as single-page tifs.

However, if we want backwards compatibility with say MM 1.4.22, it would be necessary to have a single-page tiff reader. I do not see our group ever regressing to this version of MM, but I am not sure about other labs/groups.

mattersoflight commented 3 years ago

During the last discussion, we decided that:

We won't support MM 1.4.22 acquisitions with new pipeline (recOrder/waveOrder).
We will save data in performant multi-page formats: ome.tif at acquisition and .zarr during analysis.
We will add readers to allow re-analysis of existing data with new algorithms OR write converters that transform single-page tiffs into ome.tif.

It looks like @bryantChhun is working towards a function that will map single-page tiffs into AICSimageio object. That is useful down the line. https://github.com/mehta-lab/decOrder/blob/ce3833972e32b3981e4bf9f17b6f094108405003/deconorder/io/image_readers.py#L59

But, more immediate use case is to read .ome.tif and slice it as needed for processing the data.

ieivanov commented 3 years ago

I should mention that pycro-manager has its own data reader (https://pycro-manager.readthedocs.io/en/latest/read_data.html) as data saved with pycro-manager doesn't need to comply to the OME-TIFF standard (e.g. in pycro-manager the first time point can have 10 z slices and the next time point can have 20). pycro-manager saves data in a regular multipage tif file and then uses the metadata to assemble an array.

I don't know if the pycromanager data reader will read regular ome.tif files saved by micro-manager or if data slicing is as efficient as when using aicsimageio. We should look into that too, it could be that the pycromanager data reader is what we're looking for.

mehta-lab / recOrder