pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Coordinates not deep copy #7463

Open mgaspary opened 1 year ago

mgaspary commented 1 year ago

What happened?

The coordinates are not copied if you perform deepcopy for xarray. This issue was fixed before. I don't see it, for example, in xarray 2022.12.0, but in the latest version the issue is there again.

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

Interesting, that if your coordinates are int, then the issue is gone

What did you expect to happen?

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # I expect it to be 0.0

Minimal Complete Verifiable Example

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray 2023.1.0 python 3.8.10
headtr1ck commented 1 year ago

It seems that in copy IndexVariables are treated special and are explicitly excluded from copying. @benbovy what was the reasoning behind this choice?

benbovy commented 1 year ago

I think that the reverting change in IndexVariable came after refactoring copy in Xarray introduced some performance regression (https://github.com/pydata/xarray/pull/7209#issuecomment-1305593478).

I didn't see #1463 (https://github.com/pydata/xarray/issues/1463#issuecomment-340454702), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

headtr1ck commented 1 year ago

I didn't see #1463 (#1463 (comment)), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

should we then raise an error if someone tries to replace values in an index?

benbovy commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

mgaspary commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

I would assume that deepcopy are completely going to copy the object, so if I change internal parts (like coordinates here), 1st object shall not change. It definitely affects the performance, but otherwise deepcopy is not deep anymore

benbovy commented 1 year ago

There are two issues:

dcherian commented 1 year ago

whether we should continue allowing IndexVariable data be updated in place via .data property. IMO we should really deprecate it

I agree. We don't allow it on .values anyway for this reason.

Now I see we already did this in https://github.com/pydata/xarray/pull/3862/files So I guess that got lost in the indexes refactor.

whether deep=True should deep copy the Xarray index objects

I would expect deep=True to deep-copy absolutely everything. That said if the Index objects are immutable then it doesn't matter?

On that pandas thread I see:

discovered that even if the id of the index was different on the copy, modifying the cp.index.to_numpy() values was corrupting the original.

Seems like we should all be setting the numpy data to be read-only with array.flags.WRITEABLE = False

lee1043 commented 5 days ago

Hi, I am hitting the similar issue, and curious if there any plan for this issue. I think when deep=True, reasonable expectation is everything is going to be deep copied, regardless of coordinate or data variable. This seems like a long standing issue (e.g., https://groups.google.com/g/xarray/c/ksMI8TmRPaA?pli=1).

max-sixty commented 5 days ago

If there are difficulties or ambiguities here, would it be reasonable to switch to deep-copying everything for the moment?

If we can figure out optimizations on things that are definitely immutable, then great, but in the meantime any objections to defaulting to the slower-but-always-correct thing?