pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.5k stars 1.04k forks source link

Automatically map top-level functions over DataTree objects? #9106

Open TomNicholas opened 3 weeks ago

TomNicholas commented 3 weeks ago

Is your feature request related to a problem?

Sometimes you might want to map one of the xarray top-level functions (especially xr.concat or xr.merge) over DataTree objects.

Whilst this could potentially be done manually, we could also imagine generalizing top-level functions to handle this out of the box.

Describe the solution you'd like

For this to work

xr.concat([dt1, dt2], concat_dim='time')

returning a single DataTree, with xr.concat applied to sets of datasets in corresponding nodes.

Describe alternatives you've considered

We could instead not change xarray's top-level functions but still ensure that its relatively easy to achieve using map_over_subtree, i.e.

concat_datatrees = datatree.map_over_subtree(xr.concat)
dt_concatenated = concat_datatrees([dt1, dt2], dim='time')

This would still require generalizing map_over_subtree to understand iterables of DataTree objects though (see https://github.com/zarr-developers/VirtualiZarr/issues/84#issuecomment-2163789123).

Finally we could just not support this at all, in which case the only way for users to concatenate contents of datatrees node-wise is via something like

ds_concatenated = xr.concat([mytree[node].ds for subtree in mytree], dim="time")

but called for every node in the tree.

Additional context

See https://github.com/zarr-developers/VirtualiZarr/issues/84#issuecomment-2065410549 for an example of wanting to do this in VirtualiZarr (cc @jonas-spaeth).

This was actually already something we partly discussed in the datatree design meeting (https://github.com/pydata/xarray/issues/8747), but I forgot what the conclusion was (do you remember @keewis @flamingbear @owenlittlejohns?).