Open sjperkins opened 6 days ago
Really cool to see you using xarray for radio astronomy data! I didn't know we had users in that field.
I propose that the
chunks
kwarg inBackendEntrypoint.open_datatree
support a chunking dictionary per path (i.e.DataTree
Node)
Good idea! We would be happy to take a PR if you want to generalize this.
An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit
I think we should avoid the temptation to make this overly clever, at least initially, because the chunks
kwarg type is already heavily overloaded. Per-node and per-variable chunking would be sufficiently expressive for all use cases. The only other subtlety that the chunk
dict validation code would need to watch out for is duplicated coordinates.
Yes, this makes a lot of sense to me. Quite often dimension sizes will differ per node, so it does not make sense to use a single shared set of chunks.
Yes, in principle I'd like to submit a PR. Apologies for not replying, I need to devote more time to thinking about the change:
In particular, the open_datatree
(and open_group_as_dict
) defers to the backend''s open_datatree
implementation
which seems to imply that it's the backend's responsbility to interpret the chunks
dictionary and pass it through to the backend's or xarray's open_dataset
method. There doesn't immediately see a good way to do this by intercepting chunks
before the API calls and dispatching the appropriate chunking strategy/schema to each dataset.
Perhaps the full chunking schema/strategy could be passed to the open_dataset
method, along with the tree node path so that open_dataset
can make the decision? But that seems ugly.
Neither of the above seem appealing -- I'll try find some more time to think about this.
I'm not sure I didn't miss anything, but I don't think open_datatree
does support dask
/ chunking at all right now: the code of the backends does not handle / receive chunks
, which I believe is by design. open_dataset
calls _dataset_from_backend_dataset
after the call to backend.open_dataset
to do that, so I think open_datatree
should do something similar.
The missing _datatree_from_backend_datatree
would then also be the natural place for handling the per-group chunk arguments.
Is your feature request related to a problem?
In the radio astronomy domain specific xarray-ms, we construct a DataTree representing partitions of a legacy data format where each partition contains regular data cubes. As currently implemented, the custom backend supports a
partition_chunks
kwarg in theBackendEntrypoint.open_datatree
method so that it is possible to specify different chunking schemas per partition:The chunking specification above is specific to a radio astronomy legacy format, but it may be more generally useful to be able to specify per-DataTree node chunking.
Describe the solution you'd like
Currently,
BackendEntrypoint.open_datatree
passes it'schunks
kwarg to eachDataset
constructor in the DataTree. This is quite coarse-grained as it applies the same chunking schema to all Datasets in the DataTree.I propose that the
chunks
kwarg inBackendEntrypoint.open_datatree
support a chunking dictionary per path (i.e. DataTree Node). For example:Then, when constructing Datasets in the DataTree, the chunking schema appropriate to the node can be applied.
An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit
Describe alternatives you've considered
We've implemented a custom
partition_chunks
kwarg argument in theBackendEntrypoint.open_datatree
method for our legacy data format.Additional context
No response