pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

DataArray.rename does not copy data #9432

Open zerothi opened 1 week ago

zerothi commented 1 week ago

What is your issue?

From the documentation: https://docs.xarray.dev/en/stable/generated/xarray.DataArray.rename.html#xarray.DataArray.rename it isn't clear if the contained data is copied (new memory), or whether it is retained.

It says:

Returns a new DataArray with renamed coordinates, dimensions or a new name.

which may indicate that it copies, however it doesn't:

import numpy as np
import xarray as xr

N = 100
O = xr.DataArray(np.random.rand(N, N), dims=["state [r]", "state [c]"])
s = xr.DataArray(np.random.rand(N, 2), dims=["state [c]", "spin"])

import time
t0 = time.time()
_ = O.rename({"state [r]": "state [c]", "state [c]": "state [r]"})
print("rename: ", time.time() - t0)
print(np.shares_memory(O.values, _.values))

t0 = time.time()
_ = O.values.copy()
print("copy: ", time.time() - t0)
print(np.shares_memory(O.values, _))

Now:

  1. I think the current behaviour is correct. It is nice that we can rename stuff, without having to copy the entire array.
  2. I only think a clarification is necessary for the documentation.
  3. If there are plans for this to copy it, then I think there should atleast be some inplace arguments allowing this.
max-sixty commented 1 week ago

What would you suggest as a change?

zerothi commented 1 week ago

Good question, it probably involves much more elaboration.

Returns a new DataArray with renamed coordinates, dimensions or a new name. The new DataArray will share memory location with the old DataArray: np.shares_memory(new.values, old.values) is True.

I don't know how to do this succinctly without ambiguity. ;) It was just a puzzle to me since I didn't know what to expect from doing the operation. And since there is no declaration of the intended use, I am not sure I can trust xarray to do the same in newer releases. Since I would depend on it not making a copy, I just wanted to know what xarray intends to do, and if it wants to keep the current behavior, I would like it emphasized.

I guess the same question/clarification could be said about https://docs.xarray.dev/en/latest/generated/xarray.Dataset.rename.html and friends.

I can make a PR, in case there is a common ground as to what it should say... :)

max-sixty commented 1 week ago

I think the distinction here is between a new DataArray and a copy of the underlying data.

To me that's less confusing but possibly that's because I'm too deep in it.

One issue is that a substantial number of methods do this, and we don't want to have the same description on every method...

zerothi commented 1 week ago

I see, if I read new DataArray I would implicitly assume there is no connection between the new and the old object. Where does the boundary of new lie? Dimension names? Coords and values, data values, attributes, etc? It isn't clear (at least to me) what new encompasses.

I agree about the number of methods that does this would mean a lot of changes, and digging in the code. I could, of course, also be wrong here. But I would think that most users don't know about this detail.

zerothi commented 1 week ago

I.e. this might suprise some users:

import xarray as xr
import numpy as np

data = np.random.rand(3, 3)
print(data)
a = xr.DataArray(
    data,
    dims=list("xy"),
)
print(a)

b = a.rename({"x": "z"})
print(b)

b[0, 1] = 2
print(a)
max-sixty commented 1 week ago

One change could just be:

- Returns a new DataArray with renamed coordinates, dimensions or a new name.
+ Returns a DataArray with renamed coordinates, dimensions or a new name.

...which is possibly less confusing while being simpler. Then if folks want to know about when xarray copies the underlying data, they can look at .copy and learn more there.

Would accept this as a PR unless there are any objections

zerothi commented 1 week ago

I don't see the connection to copy, if anything it should be:

- Returns a new DataArray with renamed coordinates, dimensions or a new name.
+ Returns a  shallow copied(?) DataArray with renamed coordinates, dimensions or a new name.

Is copied necessary here?
for absolute clarity. With your proposal you still don't know if it is a shallow or deep copy. And reading copy does not give any more information.

max-sixty commented 6 days ago

OK, I had thought that this might shed some light on the difference between shallow & deep copies: https://github.com/pydata/xarray/blob/a74a60508edd844ef425637f27598f1d8c5385a2/xarray/core/dataarray.py#L1230-L1237. On reflection I agree it's a bit tangential.

Personally I think that "Returns a shallow copied DataArray" is too many words + not a great description of what's going on — it's a new object rather than a shallow copy. But if others believe it to be better, very happy to take the majority view.

I would support removing the new given it reduces the word count + I think removes some minor confusion...

And if we don't already have a section on how most method work — i.e. they return a new object with the original data as a property — then that would be a great contribution.

zerothi commented 6 days ago

I agree it isn't easy. Hard to clarify without lots of words. Perhaps for the sake of changing something we could stick with removing new, then if this pops up later, we at least have a reference here.

I just think it would be simpler if the documentation clarifies the intention and behaviour, since currently I do tests to see whats happening, but I don't know if these things changes down the line.

The same thing happened to scipy objects (sparse matrices) where __i<op>__ methods didn't actually behave like one would think. I know it is hard, but clarity is nice! ;)

max-sixty commented 6 days ago

Yes totally!

For xarray, the mental model is that all methods:

I think that mental model gets us to a simple correct answer 98% of the time (the only remaining ambiguities are a couple of small cases on whether a copy is required)