ome / ome2024-ngff-challenge

Project planning and material repository for the 2024 challenge to generate 1 PB of OME-Zarr data
https://pypi.org/project/ome2024-ngff-challenge/
BSD 3-Clause "New" or "Revised" License
11 stars 8 forks source link

Document what `ome2024-ngff-challenge` does under the hood #24

Closed dstansby closed 3 weeks ago

dstansby commented 1 month ago

The README says to run ome2024-ngff-challenge input.zarr output.zarr, but I am reluctant to run this on my multi-TB datasets in case it reads the whole lot into memory 😆 . It would be nice to add a bit of clarification as to what ome2024-ngff-challenge does. Does it create a copy of the data? Is is parallelised somehow? Does it modify data or metadata in place?

will-moore commented 1 month ago

I actually have similar questions myself! Since we are now using TensorStore to read and write the data (instead of Dask) I'm not so familiar with the data loading behaviour.

I am starting to test on larger Images e.g. https://github.com/ome/ome2024-ngff-challenge/pull/23 to see how well they are handled by the conversion... Currently the default sharding behaviour (create a single shard that contains the whole array) isn't ideal for bigger images. There is provision for providing shard shapes in a user-edited parameters.json file but it would be nicer to define some logic for automatically picking shard shape based on chunk shape etc. Any suggestions, feedback, help etc appreciated!