Open mgaspary opened 1 year ago
It seems that in copy IndexVariables are treated special and are explicitly excluded from copying. @benbovy what was the reasoning behind this choice?
I think that the reverting change in IndexVariable came after refactoring copy in Xarray introduced some performance regression (https://github.com/pydata/xarray/pull/7209#issuecomment-1305593478).
I didn't see #1463 (https://github.com/pydata/xarray/issues/1463#issuecomment-340454702), though. It feels weird to me that we can mutate an IndexVariable via its data
property, considering that the underlying index is immutable. IIUC xarr2.x.data[0] = 45
replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.
I didn't see #1463 (#1463 (comment)), though. It feels weird to me that we can mutate an IndexVariable via its
data
property, considering that the underlying index is immutable. IIUCxarr2.x.data[0] = 45
replaces the full index with a new one? I'm not sure if it is a good idea to allow this. For a pandas index that's probably OK (it is reasonably cheap to rebuild a new index) but for a custom index that is expensive to build (e.g., kd-tree) I don't think this behavior is desirable.
should we then raise an error if someone tries to replace values in an index?
Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.
EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.
Yes I think we should, but I might have missed the rationale behind allowing it if this is intentional.
EDIT: perhaps better to issue a warning first to avoid some breaking change. We could also try to fix it (make a deep copy) at the same time as deprecating it, but that might be tricky without again introducing performance regressions.
I would assume that deepcopy are completely going to copy the object, so if I change internal parts (like coordinates here), 1st object shall not change. It definitely affects the performance, but otherwise deepcopy
is not deep
anymore
There are two issues:
whether we should continue allowing IndexVariable data be updated in place via .data
property. IMO we should really deprecate it, especially that now it is possible to have custom, possibly expensive index structures built from one or more coordinates.
whether deep=True
should deep copy the Xarray index objects. I don't have strong opinion on this. There is a similar discussion on the pandas side: https://github.com/pandas-dev/pandas/issues/19862. I wonder if we reverted the change here because some high-level operations in Xarray were by default deep copying the indexes? I don't think we would want such behavior unless the user explicitly sets deep=True
somewhere?
whether we should continue allowing IndexVariable data be updated in place via .data property. IMO we should really deprecate it
I agree. We don't allow it on .values
anyway for this reason.
Now I see we already did this in https://github.com/pydata/xarray/pull/3862/files So I guess that got lost in the indexes refactor.
whether deep=True should deep copy the Xarray index objects
I would expect deep=True
to deep-copy absolutely everything. That said if the Index objects are immutable then it doesn't matter?
On that pandas thread I see:
discovered that even if the id of the index was different on the copy, modifying the
cp.index.to_numpy()
values was corrupting the original.
Seems like we should all be setting the numpy data to be read-only with array.flags.WRITEABLE = False
Hi, I am hitting the similar issue, and curious if there any plan for this issue. I think when deep=True
, reasonable expectation is everything is going to be deep copied, regardless of coordinate or data variable. This seems like a long standing issue (e.g., https://groups.google.com/g/xarray/c/ksMI8TmRPaA?pli=1).
If there are difficulties or ambiguities here, would it be reasonable to switch to deep-copying everything for the moment?
If we can figure out optimizations on things that are definitely immutable, then great, but in the meantime any objections to defaulting to the slower-but-always-correct thing?
I also just stumbled across this bug. A rogue plotting script is changing the longitude of a map internally, and even when passing the script a da.copy(), the da's own longitude is modified after plotting. Deep copies should be truly deep. If the performance loss is too large to pass on to everyone, the copy function should get a new optional parameter to force the really-deep copy mode.
I don't hear anyone clamoring to retain skipping indexes from deep copy. Would anyone like to do a PR to include them in deep copy, and we'll plan to merge that unless anyone comes up with a thought-through objection?
What happened?
The coordinates are not copied if you perform deepcopy for xarray. This issue was fixed before. I don't see it, for example, in
xarray 2022.12.0
, but in the latest version the issue is there again.Interesting, that if your coordinates are
int
, then the issue is goneWhat did you expect to happen?
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment