ome / ome-zarr-py

Implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.
https://pypi.org/project/ome-zarr
Other
152 stars 54 forks source link

Best practices for generating multiscale zarr data? #215

Open GenevieveBuckley opened 2 years ago

GenevieveBuckley commented 2 years ago

What is the current best practice for generating & saving a multiscale zarr array, given a single resolution of that data?

I gather things have changed a lot recently with the improvements to OME NGFF, so I feel like I need to ask the question. I've talked to a few people who say they use a python script they or someone else in the lab wrote, but then say it might be a little bit hacky and they're not completely sure if it's compliant with the latest NGFF.

I've looked at the docs, but it hasn't completely clarified things for me. The write_multiscale function seems like the best option, but requires users to have already generated the resolution levels externally (so the question is still, what is the best practice recommendation for that). Worse, write_multiscale appears to only take in a list of numpy arrays, which is a little odd. If I could reliably fit my high resolution data in memory as a numpy array, I wouldn't need to use zarr at all.

The regular function for writing a zarr array seems to have a keyword argument for a downsampling function, but not much information on what that function should be like, or how to use the feature. (Unless I've just missed it, please point me to the right section of the docs if there's more info somewhere!)

constantinpape commented 2 years ago

Hi @GenevieveBuckley, there are convenience functions for also creating the multi-scales in ome_zarr. Here's an example workflow script I wrote to demonstrate the usage:https://github.com/ome/ome-ngff-prototypes/blob/main/workflows/spatial-transcriptomics-example/convert_transcriptomics_data_to_ngff.py#L39-L64 (Though I fully agree that overall this needs to be better documented ...)

Also note that using the local_mean option is currently not working, see #217, but you can e.g. use nearest instead.

(Sorry, closed by accident)

joshmoore commented 2 years ago

@toloudis / @will-moore: thoughts on the rolling out of (and/or testing of) https://github.com/ome/ome-zarr-py/pull/192 here?

toloudis commented 2 years ago

makes sense to me. At best, it will probably lead to improvements and may verify some of the performance issues I was seeing with large data and dask resizing.

It might also be instructive to look at this Pull Request in aicsimageio, building on top of #192: https://github.com/AllenCellModeling/aicsimageio/pull/381 , which includes a ipynb file demonstrating loading a single resolution image and saving a multiresolution zarr. Inside the OmeZarrWriter is the code that forwards the arrays to ome-zarr-py

joshmoore commented 2 years ago

:+1: @GenevieveBuckley, just one more minor change on that PR and then I'll get it released. Happy to have some testing either before or after.

toloudis commented 1 year ago

I'm also interested in optimal implementations for generating downsampled data for large datasets. There are many alternate implementations to the Scaler -- one intriguing one is here: https://github.com/spatial-image/multiscale-spatial-image , which seems to be nice and general, and dask-ready, but I have not attempted to use it yet.