pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Inconsistent `target_chunks` api behavior between zarr group and xarray dataset #76

Open rabernat opened 3 years ago

rabernat commented 3 years ago

The docs say the following about the target_chunks argument when rechunking a group:

For a group of arrays, a dict is required. The keys correspond to array names. The values are target_chunks arguments for the array. For example, {'foo': (20, 10), 'bar': {'x': 3, 'y': 5}, 'baz': None}. All arrays you want to rechunk must be explicitly named. Arrays that are not present in the target_chunks dict will be ignored.

Xarray datasets are very similar to Zarr groups. However, the behavior is a bit different with Xarray datasets. This difference is documented in the tests, but not the docs. Here is the target_chunks parameter for test_rechunk_dataset https://github.com/pangeo-data/rechunker/blob/f55a475776f13180a9c41d0c3e652ac9ed17c4a9/tests/test_rechunk.py#L49-L50 Note that the variable c is not present. However, it is present in the output dataset: https://github.com/pangeo-data/rechunker/blob/f55a475776f13180a9c41d0c3e652ac9ed17c4a9/tests/test_rechunk.py#L123-L125 The original chunks have been preserved, a reasonable default.

We should strive to reconcile, or at least document, this difference. My personal preference would be to change the API so that at flat zarr group behaves the same as the xarray dataset: variables that are not mentioned in target_chunks simply get passed through with identical chunks.

cc @eric-czech who wrote the test_rechunk_dataset so probably understands this part of the code the best.