pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Feature request: 'auto' in `target_chunks` #71

Open shz9 opened 3 years ago

shz9 commented 3 years ago

Hi,

Thanks for the great package! I'm currently using it in one of my projects to rechunk large symmetric matrices along a given axis. However, I'm missing a feature that I liked in Dask: automatically determining the chunk size for a given dimension. For example, say that I have the following use case:

import zarr
import dask.array as da

d = da.ones((10000, 10000))
d = d.rechunk({0: 'auto', 1: None})
d.to_zarr('my_store.zarr')

Is it possible to add a feature to accomplish the same thing in rechunker? I'm currently doing it the hacky way:

import psutil
import zarr
import dask.array as da
from rechunker import rechunk

d = da.ones((10000, 10000))
d.to_zarr('my_store.zarr')
z = zarr.open('my_store.zarr')

...

rechunked = rechunk(z,
                    target_chunks=d.rechunk({0: 'auto', 1: None}).chunksize,
                    target_store=target_store,
                    temp_store=intermediate_store,
                    max_mem=psutil.virtual_memory().available / psutil.cpu_count())

rechunked.execute()

Hope this makes sense. Thanks!

rabernat commented 3 years ago

This is a great idea, and I'd love to support it.

One question: how does dask determine the chunk size in the 'auto' dimensions? Do we feel that the same logic is appropriate in rechunker?

If so, we can probably just reuse dask's normalize_chunks function to implement this.

shz9 commented 3 years ago

I think the relevant function from Dask is here: https://github.com/dask/dask/blob/a988716cfeb3a9b1015d14a334368e70ae382553/dask/array/core.py#L2709

I believe it depends on a configurable limit on the size of the chunks config.get("array.chunk-size"), which can be easily incorporated into the rechunk function. Re-using normalize_chunks would also work fine, as it handles many other cases (e.g. -1 or None for some dimensions).

rabernat commented 3 years ago

We would welcome a pull request if you feel comfortable trying to implement this yourself. 😊