pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.6k stars 1.08k forks source link

open_datatree(group='some_subgroup') returning parent nodes #9665

Open eni-awowale opened 3 hours ago

eni-awowale commented 3 hours ago

What is your issue?

@aladinor Noticed this during a demo a few meetings back but I don't think we followed up on this.

If you have a DataTree of this shape.

<xarray.DataTree>
Group: /
│   Dimensions:        (lat: 1, lon: 2)
│   Dimensions without coordinates: lat, lon
│   Data variables:
│       root_variable  (lat, lon) float64 16B ...
└── Group: /Group1
    │   Dimensions:      (lat: 1, lon: 2)
    │   Dimensions without coordinates: lat, lon
    │   Data variables:
    │       group_1_var  (lat, lon) float64 16B ...
    └── Group: /Group1/subgroup1
            Dimensions:        (lat: 1, lon: 2)
            Dimensions without coordinates: lat, lon
            Data variables:
                subgroup1_var  (lat, lon) float64 16B ...

And you specify a path with group= you still get a nested tree but with empty groups for the groups that were not specified.

In  [1]: open_datatree('filename.nc', engine='netcdf4', group='/Group1/subgroup')
Out [1]: 
<xarray.DataTree>
Group: /
└── Group: /Group1
    └── Group: /Group1/subgroup1
            Dimensions:        (lat: 1, lon: 2)
            Dimensions without coordinates: lat, lon
            Data variables:
                subgroup1_var  (lat, lon) float64 16B ...

I thought the expected result would be to only return the group specified with all of it's child nodes (if it has any), something like:

<xarray.DataTree>
Group: /Group1/subgroup1
            Dimensions:        (lat: 1, lon: 2)
            Dimensions without coordinates: lat, lon
            Data variables:
                subgroup1_var  (lat, lon) float64 16B ...

CCing the usual squad @shoyer, @keewis, @TomNicholas, @owenlittlejohns, and @flamingbear

TomNicholas commented 3 hours ago

Yes, good catch, we should fix that. I think the returned result has to be

<xarray.DataTree>
Group: /subgroup1
    Dimensions:        (lat: 1, lon: 2)
    Dimensions without coordinates: lat, lon
    Data variables:
        subgroup1_var  (lat, lon) float64 16B ...

because you can't have a group name containing slashes.

The simplest way to to fix this would be to prune the groups_dict returned by open_groups_as_dict before giving it to DataTree.from_dict (or returning it like open_groups does, because this issue likely applies there too).

The proper way to fix it would be to fix the behaviour of open_groups_as_dict.

FYI the conclusion of the discussion today was that any coordinates defined above subgroup1 should be ignored by default.

Are you up for taking this one on @eni-awowale ?

aladinor commented 3 hours ago

Thanks, @eni-awowale, for bringing this up. I am working on it, and I think it will be resolved soon.

keewis commented 3 hours ago

I think what we talked about yesterday was to make subgroup1 the root of the returned DataTree object, and then we can attach a source_group encoding (or something similar) in case we want to look up where the tree came from.

TomNicholas commented 3 hours ago

make subgroup1 the root of the returned DataTree object

Yep that's what I was trying to say above.

attach a source_group

Oh yes, good point.

I am working on it

Do you want to post up your PR @aladinor (even if it doesn't work yet)? Then we can help get it in asap.

aladinor commented 3 hours ago

Sure! I will do it ASAP.