Coordinates not deep copy

mgaspary commented 1 year ago

What happened?

The coordinates are not copied if you perform deepcopy for xarray. This issue was fixed before. I don't see it, for example, in xarray 2022.12.0, but in the latest version the issue is there again.

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

Interesting, that if your coordinates are int, then the issue is gone

What did you expect to happen?

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # I expect it to be 0.0

Minimal Complete Verifiable Example

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray 2023.1.0 python 3.8.10

headtr1ck commented 1 year ago

It seems that in copy IndexVariables are treated special and are explicitly excluded from copying. @benbovy what was the reasoning behind this choice?

benbovy commented 1 year ago

I think that the reverting change in IndexVariable came after refactoring copy in Xarray introduced some performance regression (https://github.com/pydata/xarray/pull/7209#issuecomment-1305593478).

I didn't see #1463 (https://github.com/pydata/xarray/issues/1463#issuecomment-340454702), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

headtr1ck commented 1 year ago

I didn't see #1463 (#1463 (comment)), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

should we then raise an error if someone tries to replace values in an index?

benbovy commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

mgaspary commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

I would assume that deepcopy are completely going to copy the object, so if I change internal parts (like coordinates here), 1st object shall not change. It definitely affects the performance, but otherwise deepcopy is not deep anymore

benbovy commented 1 year ago

There are two issues:

whether we should continue allowing IndexVariable data be updated in place via .data property. IMO we should really deprecate it, especially that now it is possible to have custom, possibly expensive index structures built from one or more coordinates.
whether deep=True should deep copy the Xarray index objects. I don't have strong opinion on this. There is a similar discussion on the pandas side: https://github.com/pandas-dev/pandas/issues/19862. I wonder if we reverted the change here because some high-level operations in Xarray were by default deep copying the indexes? I don't think we would want such behavior unless the user explicitly sets deep=True somewhere?

dcherian commented 1 year ago

whether we should continue allowing IndexVariable data be updated in place via .data property. IMO we should really deprecate it

I agree. We don't allow it on .values anyway for this reason.

Now I see we already did this in https://github.com/pydata/xarray/pull/3862/files So I guess that got lost in the indexes refactor.

whether deep=True should deep copy the Xarray index objects

I would expect deep=True to deep-copy absolutely everything. That said if the Index objects are immutable then it doesn't matter?

On that pandas thread I see:

discovered that even if the id of the index was different on the copy, modifying the cp.index.to_numpy() values was corrupting the original.

Seems like we should all be setting the numpy data to be read-only with array.flags.WRITEABLE = False

lee1043 commented 2 months ago

Hi, I am hitting the similar issue, and curious if there any plan for this issue. I think when deep=True, reasonable expectation is everything is going to be deep copied, regardless of coordinate or data variable. This seems like a long standing issue (e.g., https://groups.google.com/g/xarray/c/ksMI8TmRPaA?pli=1).

max-sixty commented 2 months ago

If there are difficulties or ambiguities here, would it be reasonable to switch to deep-copying everything for the moment?

If we can figure out optimizations on things that are definitely immutable, then great, but in the meantime any objections to defaulting to the slower-but-always-correct thing?

juntyr commented 1 month ago

I also just stumbled across this bug. A rogue plotting script is changing the longitude of a map internally, and even when passing the script a da.copy(), the da's own longitude is modified after plotting. Deep copies should be truly deep. If the performance loss is too large to pass on to everyone, the copy function should get a new optional parameter to force the really-deep copy mode.

max-sixty commented 1 month ago

I don't hear anyone clamoring to retain skipping indexes from deep copy. Would anyone like to do a PR to include them in deep copy, and we'll plan to merge that unless anyone comes up with a thought-through objection?

pydata / xarray