pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
163 stars 25 forks source link

Rechunk group to group #79

Open valpesendorfer opened 3 years ago

valpesendorfer commented 3 years ago

Hello,

I'm experimenting with multi-group zarrs, where each group represents a separate tile. The tile themselves are structured the same way (same dimensions, variables, etc) and are created with xarray.

Here's a small example of how it can look like:


/
 ├── h19v07
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 ├── h19v08
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 ├── h20v07
 │   ├── band (1200, 1200, 5) int16
 │   ├── time (5,) int64
 │   ├── x (1200,) float64
 │   └── y (1200,) float64
 └── h20v08
     ├── band (1200, 1200, 5) int16
     ├── time (5,) int64
     ├── x (1200,) float64
     └── y (1200,) float64

I was wondering if there's a way to either

The only way I was successful in rechunking the groups was in iterating over each one, and running rechunk individually with the group path (i.e. target.zarr/group) set as target_store.

Thanks

rabernat commented 3 years ago

Hi @valpesendorfer! 👋 Thanks for this interesting issue.

You're correct: right now, we only support Zarr arrays or flat groups (no groups within group). The challenge here is that Zarr support infinitely deep nesting of groups. It's hard for me to think of the right behavior.

So I have a question for you. What API syntax (call to rechunk) would you like to see here? Specifically, how would you specify target_chunks for these nested groups?

valpesendorfer commented 3 years ago

👋 and may I add, thanks for all your work.

First, good to know I'm not missing something essential, I've started with zarrs only recently.

To answer your Q, I don't know exactly ...

But ideally, the manual work I need to do in iterating over groups would be taken care of by rechunker - and since this all is expected to be executed on a dask cluster (in my case), this could also be more efficient / faster as all the tasks can be passed at once. I've not tried yet to generate all the plans first and then execute them on the cluster ... I was too worried about existing temporary storage killing the processes.

In this specific case, I don't want to do anything fancy, meaning I want to keep the same structure, just with the band array re-chunked.

So to specify target_chunks, this is what I currently do for each group:

target_chunks = {
    'band': (256, 256, 5),
}

This could be extended to each group that has an array band.

Or, more verbosely, using a nested dictionary with a group : array : chunk syntax, something like


target_chunks = {

    "/full/path/to/group1": {
        "band": (256, 256, 5),
    },

    "/full/path/to/group2": {
        "band": (256, 256, 5),
    }
}

Which could be generated dynamically:

zarr_raw = zarr.open("raw.zarr", mode="r") 

target_chunks = { group: {"band": (256, 256, 5)} for group in zarr_raw.group_keys()}

Edit

Ok, I see now the code above doesn't work for nested groups ... that would require something smarter. I've only worked with groups as exemplified above, and I don't think there's any reason for me to go deeper.