hyperspy / rosettasciio

Python library for reading and writing scientific data format
https://hyperspy.org/rosettasciio
GNU General Public License v3.0
47 stars 28 forks source link

Test Sharding Implementation #260

Open CSSFrancis opened 4 months ago

CSSFrancis commented 4 months ago

Describe the functionality you would like to see.

For 4-D STEM there are some operations which would work significantly better with the dataset chunked equally in all dimensions. For example if you want to do something like make a virtual image, or apply a gaussian filter in real space. This has traditionally been a pain because zarr likes large chunks which translate to fast parallel operations. Similarly, hyperspy likes no chunks in the signal dimensions for the map function and for plotting.

With the V3 spec for zarr and the sharding implementation we might be able to rethink how we handle things. For example we could have the data in a format like:

image

Where it essentially acts like the current ideal data strucuture but within the sharded dataset there are small chunks which operate fast along certain dimensions. This allows us to create virtual images without loading the entire dataset into memory and reduce the memory footprint when doing things like rechunking.

This might not be ready (quite yet) as there are some issues to solve regaurding speeding up the sharding implementation. https://github.com/zarr-developers/zarr-python/discussions/1338

It is worth a disucssion about if this is something worth persuing.