pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Per-node DataTree chunking #9634

Open sjperkins opened 6 days ago

sjperkins commented 6 days ago

Is your feature request related to a problem?

In the radio astronomy domain specific xarray-ms, we construct a DataTree representing partitions of a legacy data format where each partition contains regular data cubes. As currently implemented, the custom backend supports a partition_chunks kwarg in the BackendEntrypoint.open_datatree method so that it is possible to specify different chunking schemas per partition:

The chunking specification above is specific to a radio astronomy legacy format, but it may be more generally useful to be able to specify per-DataTree node chunking.

Describe the solution you'd like

Currently, BackendEntrypoint.open_datatree passes it's chunks kwarg to each Dataset constructor in the DataTree. This is quite coarse-grained as it applies the same chunking schema to all Datasets in the DataTree.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node). For example:

import xarray

xdt = xarray.open_datatree(..., chunks={
  "/path/to/node1": {"time": 20, "frequency": 16},
  "/path/to/a/node2": {"time": 10, "frequency": 4},
}

Then, when constructing Datasets in the DataTree, the chunking schema appropriate to the node can be applied.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

xd = xarray.open_datatree(..., chunks={
  # Apply to node1 and any node below
  "/path/to/node1/...": {"time": 20, "frequency": 16}
}

Describe alternatives you've considered

We've implemented a custom partition_chunks kwarg argument in the BackendEntrypoint.open_datatree method for our legacy data format.

Additional context

No response

TomNicholas commented 6 days ago

Really cool to see you using xarray for radio astronomy data! I didn't know we had users in that field.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node)

Good idea! We would be happy to take a PR if you want to generalize this.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

I think we should avoid the temptation to make this overly clever, at least initially, because the chunks kwarg type is already heavily overloaded. Per-node and per-variable chunking would be sufficiently expressive for all use cases. The only other subtlety that the chunk dict validation code would need to watch out for is duplicated coordinates.

shoyer commented 1 day ago

Yes, this makes a lot of sense to me. Quite often dimension sizes will differ per node, so it does not make sense to use a single shared set of chunks.

sjperkins commented 20 hours ago

Yes, in principle I'd like to submit a PR. Apologies for not replying, I need to devote more time to thinking about the change:

In particular, the open_datatree (and open_group_as_dict) defers to the backend''s open_datatree implementation

https://github.com/pydata/xarray/blob/863184dfaa9c89b3f272da3042e5e9beaac200f3/xarray/backends/api.py#L859-L864

https://github.com/pydata/xarray/blob/863184dfaa9c89b3f272da3042e5e9beaac200f3/xarray/backends/api.py#L896-L901

which seems to imply that it's the backend's responsbility to interpret the chunks dictionary and pass it through to the backend's or xarray's open_dataset method. There doesn't immediately see a good way to do this by intercepting chunks before the API calls and dispatching the appropriate chunking strategy/schema to each dataset.

Perhaps the full chunking schema/strategy could be passed to the open_dataset method, along with the tree node path so that open_dataset can make the decision? But that seems ugly.

Neither of the above seem appealing -- I'll try find some more time to think about this.

keewis commented 17 hours ago

I'm not sure I didn't miss anything, but I don't think open_datatree does support dask / chunking at all right now: the code of the backends does not handle / receive chunks, which I believe is by design. open_dataset calls _dataset_from_backend_dataset after the call to backend.open_dataset to do that, so I think open_datatree should do something similar.

The missing _datatree_from_backend_datatree would then also be the natural place for handling the per-group chunk arguments.