pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Coordinates not deep copy #7463

Open mgaspary opened 1 year ago

mgaspary commented 1 year ago

What happened?

The coordinates are not copied if you perform deepcopy for xarray. This issue was fixed before. I don't see it, for example, in xarray 2022.12.0, but in the latest version the issue is there again.

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

Interesting, that if your coordinates are int, then the issue is gone

What did you expect to happen?

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # I expect it to be 0.0

Minimal Complete Verifiable Example

import xarray as xr

xarr1 = xr.DataArray(
    np.zeros([2]), 
    coords=dict(x=[0.0, 1.0]), # important to use 'float' here! with 'int' it is working fine
    dims=("x")
)
print(xarr1.x.data[0]) # 0.0

xarr2 = xarr1.copy(deep=True)
xarr2.x.data[0] = 45
print(xarr1.x.data[0]) # gives 45

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray 2023.1.0 python 3.8.10
headtr1ck commented 1 year ago

It seems that in copy IndexVariables are treated special and are explicitly excluded from copying. @benbovy what was the reasoning behind this choice?

benbovy commented 1 year ago

I think that the reverting change in IndexVariable came after refactoring copy in Xarray introduced some performance regression (https://github.com/pydata/xarray/pull/7209#issuecomment-1305593478).

I didn't see #1463 (https://github.com/pydata/xarray/issues/1463#issuecomment-340454702), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

headtr1ck commented 1 year ago

I didn't see #1463 (#1463 (comment)), though. It feels weird to me that we can mutate an IndexVariable via its data property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45 replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.

should we then raise an error if someone tries to replace values in an index?

benbovy commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

mgaspary commented 1 year ago

Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.

EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.

I would assume that deepcopy are completely going to copy the object, so if I change internal parts (like coordinates here), 1st object shall not change. It definitely affects the performance, but otherwise deepcopy is not deep anymore

benbovy commented 1 year ago

There are two issues:

dcherian commented 1 year ago

whether we should continue allowing IndexVariable data be updated in place via .data property. IMO we should really deprecate it

I agree. We don't allow it on .values anyway for this reason.

Now I see we already did this in https://github.com/pydata/xarray/pull/3862/files So I guess that got lost in the indexes refactor.

whether deep=True should deep copy the Xarray index objects

I would expect deep=True to deep-copy absolutely everything. That said if the Index objects are immutable then it doesn't matter?

On that pandas thread I see:

discovered that even if the id of the index was different on the copy, modifying the cp.index.to_numpy() values was corrupting the original.

Seems like we should all be setting the numpy data to be read-only with array.flags.WRITEABLE = False

lee1043 commented 2 months ago

Hi, I am hitting the similar issue, and curious if there any plan for this issue. I think when deep=True, reasonable expectation is everything is going to be deep copied, regardless of coordinate or data variable. This seems like a long standing issue (e.g., https://groups.google.com/g/xarray/c/ksMI8TmRPaA?pli=1).

max-sixty commented 2 months ago

If there are difficulties or ambiguities here, would it be reasonable to switch to deep-copying everything for the moment?

If we can figure out optimizations on things that are definitely immutable, then great, but in the meantime any objections to defaulting to the slower-but-always-correct thing?

juntyr commented 1 month ago

I also just stumbled across this bug. A rogue plotting script is changing the longitude of a map internally, and even when passing the script a da.copy(), the da's own longitude is modified after plotting. Deep copies should be truly deep. If the performance loss is too large to pass on to everyone, the copy function should get a new optional parameter to force the really-deep copy mode.

max-sixty commented 1 month ago

I don't hear anyone clamoring to retain skipping indexes from deep copy. Would anyone like to do a PR to include them in deep copy, and we'll plan to merge that unless anyone comes up with a thought-through objection?