pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.07k forks source link

DataTree.update can cause multiple root groups. #9285

Closed flamingbear closed 1 week ago

flamingbear commented 1 month ago

What happened?

Reviewing documentation for hierarchical-data.rst I saw the abe/herbert example didn't look right, updated the abe.assign() -> abe = abe.assign() and it still looked wrong

>>> abe = xr.DataTree(name="abe")
>>> herbert = xr.DataTree(name="Herb")

>>> abe.update({"herbert": herbert})

>>> print(abe)
<xarray.DataTree 'abe'>
Group: /
└── Group: /
>>> print(abe.groups)
('/', '/')

What did you expect to happen?

I expected not to see two root groups.

Minimal Complete Verifiable Example

abe = xr.DataTree(name="abe")
herbert = xr.DataTree(name="Herb")

abe.update({"herbert": herbert})
print(abe)

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

it is failing the update or the assign (but because assign uses update)

Environment

/Users/savoie/.pyenv/versions/miniconda3-4.7.12/envs/xarray-tests/lib/python3.11/site-packages/_distutils_hack/__init__.py:26: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit: 9e4b737ee0907e16703dab79e118e51b1fc36d46 python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:45:13) [Clang 16.0.6 ] python-bits: 64 OS: Darwin OS-release: 23.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.6.1.dev88+g068bab28b.d20240724 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: installed h5netcdf: 1.3.0 h5py: 3.11.0 zarr: 2.17.2 cftime: 1.6.3 nc_time_axis: 1.4.1 iris: 3.9.0 bottleneck: 1.3.8 dask: 2024.4.2 distributed: 2024.4.2 matplotlib: 3.8.4 cartopy: 0.23.0 seaborn: 0.13.2 numbagg: 0.8.1 fsspec: 2024.3.1 cupy: None pint: 0.23 sparse: 0.15.1 flox: 0.9.6 numpy_groupies: 0.10.2 setuptools: 69.5.1 pip: 24.0 conda: None pytest: 8.1.1 mypy: 1.8.0 IPython: 8.25.0 sphinx: None
flamingbear commented 1 month ago

This might need to be a separate issue, but assign used to update in place rather than create a full new DataTree and this might be why? This is what I'm seeing now.

bart = xr.DataTree(name="Bart")
lisa = xr.DataTree(name="Lisa")
homer = xr.DataTree(name="Homer", children={"Lisa": lisa})

lisa.parent = homer
print(homer)

try:
    homer.parent = lisa
except Exception:
    print("that excepts as expected")

homer = homer.assign({"bart": bart})
print(homer)

# This shouldn't work
homer.parent = lisa
# This looks bad
print(homer)

And the output

<xarray.DataTree 'Homer'>
Group: /
└── Group: /Lisa
that excepts as expected
<xarray.DataTree 'Homer'>
Group: /
├── Group: /Lisa
└── Group: /
<xarray.DataTree 'Homer'>
Group: /Lisa/Homer
├── Group: /Lisa/Homer/Lisa
└── Group: /
TomNicholas commented 1 month ago

Thanks again @flamingbear ! I think I've tracked these down, and they are both the same issue. Basically there is a bug in the new implementation of ._replace_node (introduced in #9063).

What happens is that inside ._replace_node we assign to self._children instead of self.children, which therefore skips a bunch of _pre_attach and other stuff in treenode.py that ensures that all the paths are consistent. Without this then if you print the new merged_children inside .update you see this

In [3]: homer.assign(Bart=bart)
{'Lisa': <xarray.DataTree 'Lisa'>
Group: /Lisa, 'Bart': <xarray.DataTree 'Bart'>
Group: /}

Notice that Bart has a path of /, not /Bart as it should to be consistent with /Lisa. This means the .path attributes of the new_children have not yet been altered correctly to fit into the tree, and it's not safe to assign to self._children yet.

I think we should just assign to self.children instead.

TomNicholas commented 1 month ago

Another way to say this is: Assigning directly to self._children is a bug because it's not enough to just say "these are the new children", you also have to tell those children who their new parent is, and that step is being skipped.

FYI @shoyer

flamingbear commented 1 month ago

simply changing the _replace_node datatree.py L773 with this diff, does fix the "root names"

-        self._children = children
+        self.children = children

I'm still seeing the assign act strange (allowing cycles)

lisa = xr.DataTree(name="Lisa")
homer = xr.DataTree(name="Homer", children={"Lisa": lisa})
homer = homer.assign({})
homer.parent = lisa
print(homer)
<xarray.DataTree 'Homer'>
Group: /Lisa/Homer
└── Group: /Lisa/Homer/Lisa
flamingbear commented 1 month ago

I'm pretty sure you told me exactly how to fix this yesterday, but it didn't sink in until last night, I'll see if I can do this or at least throw something up to discuss.

flamingbear commented 1 month ago

So This was my confusion from the old code to the new w/r/t assign. It's not creating cycles. I just missed that the (current) assign behavior is creating a new homer so it's fine to make lisa his parent.

Anyway this will change again with #9297, but the original multiple root groups can be fixed with that one liner and I'll put that up now.

And actually you can close this with #9297 I'll just suggest the extra assert

flamingbear commented 1 week ago

closed in #9243