Allow rechunk to accept a Dask array

tomwhite commented 4 years ago

Thank you for all your work on this library! In testing it I have found it useful to create random persistent Zarr arrays with different chunk sizes. It struck me that this is just an application of rechunk, where the input is a Dask array rather than a Zarr array.

For example if I want to create an array of size (4000, 4000), with 10 chunks of size (400, 4000), using 5 tasks, where each task creates 2 chunks (by making the source chunks (800, 4000)), then it would be nice to do this with something like

import dask.array as da
from rechunker import rechunk
source = da.random.random(size=(4000, 4000), chunks=(800, 4000))
max_mem = ...
target_store = ...
plan = rechunk(source, (400, 4000), max_mem, target_store)
plan.execute()

The read and write chunk sizes would be the same in this case, so it's a simple pass through with no intermediate store.

Would it be possible to make rechunk cater for this by accepting a Dask array?

rabernat commented 4 years ago

Yes, in fact, this was discussed in #4! However, for the first release, we opted for only supporting zarr arrays / groups.

The big question for dask arrays is whether to try to perform "read consolidation" on the input by calling .rechunk on the dask array. @TomAugspurger suggested that this might not work optimally in all cases. There is an option to use consolidate_reads=False in rechunking_plan--we might want to enable this for dask inputs.

We would love a PR if you wanted to implement this.

tomwhite commented 4 years ago

Thanks @rabernat. I will see if I can write a PR for this.

TomAugspurger commented 4 years ago

https://github.com/dask/dask/issues/6272 is the (potential) issue with rechunking the arrays. I haven't verified that it's actually an issue in practice though.

tomwhite commented 4 years ago

Fixed in #27

pangeo-data / rechunker

Allow rechunk to accept a Dask array #21