pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Setting datarrays with non-dimension coordinates errors #8028

Open Illviljan opened 1 year ago

Illviljan commented 1 year ago

What happened?

I'm not sure if this is a bug or a feature but I was expecting this example to work since the new coord is just a slight rewrite of the original dimension coordinate:

import xarray as xr

ds = xr.tutorial.open_dataset("air_temperature")

# Change the first time value:
ds["air_new"] = ds.air.copy()
air_new_changed = ds.air_new[{"time": 0}] * 3
ds.air_new.loc[air_new_changed.coords] = air_new_changed  # Works! :)

# Add a another coord along time axis and change 
# the first time value:
ds["air_new"] = ds.air.copy().assign_coords(
    {"time_float": ds.time.astype(float)}
)
air_new_changed = ds.air_new[{"time": 0}] * 4
ds.air_new.loc[air_new_changed.coords] = air_new_changed  # Error! :(

Traceback (most recent call last):

  Cell In[25], line 5
    ds.air_new.loc[air_new_changed.coords] = air_new_changed

  File ~\AppData\Local\mambaforge\envs\jw\lib\site-packages\xarray\core\dataarray.py:222 in __setitem__
    dim_indexers = map_index_queries(self.data_array, key).dim_indexers

  File ~\AppData\Local\mambaforge\envs\jw\lib\site-packages\xarray\core\indexing.py:182 in map_index_queries
    grouped_indexers = group_indexers_by_index(obj, indexers, options)

  File ~\AppData\Local\mambaforge\envs\jw\lib\site-packages\xarray\core\indexing.py:144 in group_indexers_by_index
    raise KeyError(f"no index found for coordinate {key!r}")

KeyError: "no index found for coordinate 'time_float'"
benbovy commented 1 year ago

I'd say it is rather a feature. Despite being very similar to the original one, the new coord is not a dimension coordinate, which is thus not indexed by default so loc or other label-based indexing won't work with it.

benbovy commented 1 year ago

That said, we could probably make the error message a bit more user-friendly and verbose than just saying that no index is found, e.g.,

"No index found for coordinate 'x', which therefore cannot be used with .loc or .sel (label-based indexing). Set an index first for that coordinate using either set_index or set_xindex. Note that you can still use .iloc and .isel (integer-based indexing) when no index is set."

Illviljan commented 1 year ago

Would you mind showing the intended way for my example? Because I don't find this intuitive at all.

ds["air_new"] = ds.air.copy().assign_coords(
    {"time_float": ds.time.astype(float)}
)
# ds["air_new"] = ds["air_new"].set_xindex("time_float") # Doesn't work
a = ds["air_new"].set_xindex("time_float")
air_new_changed = a[{"time": 0}] * 4
a.loc[air_new_changed.coords] = air_new_changed

ValueError: Xarray does not support label-based selection with more than one index over the following dimension(s):
'time': 2 indexes involved
Suggestion: use a multi-index for each of those dimension(s).

Other tests:

ds.air_new.iloc[air_new_changed.coords] = air_new_changed  # Error! :(
AttributeError: 'DataArray' object has no attribute 'iloc'

ds.air_new.sel[air_new_changed.coords] = air_new_changed  # Error! :(
TypeError: 'method' object does not support item assignment
benbovy commented 1 year ago

Haha yes agreed that is confusing. So improving the error messages will require more coordination across the codebase and thinking more about the possible cases. That is not an easy task.

Regarding your example, just setting a single index for "time_float" does not work: currently you cannot call loc or sel and pass labels at once for two or more independently indexed coordinates that share the same dimension (the error message mostly makes sense except the "dimension" term that is not accurate here). This because it is easier to raise now when such case occurs and maybe figure out later how we can merge results from different index queries along the same dimension.

(My bad, Xarray has no iloc indeed... I'm mostly using isel/sel).

I'm not sure what exactly you're trying to achieve in your example, actually.

Illviljan commented 1 year ago

I simply want to modify a few values in an array. :) A similar example in numpy-land:

a = np.array([1, 1, 1, 1])
b = a[0] * 3
a[0] = b

Now in xarray land this ds["air_new"] happens to have a few helper coordinates. This is what I came up with and it took longer than I'd like to admit:

# Remove unneccessary coordinates without indexes:
ds["air_new"] = ds.air.copy().assign_coords({"time_float": ds.time.astype(float)})
air_new_changed = ds.air_new[{"time": 0}] * 3
ds.air_new.loc[
    {
        k: air_new_changed.coords[k].values
        for k in air_new_changed.coords.keys() & air_new_changed.indexes.keys()
    }
] = air_new_changed

Is this the most elegant way? How would you have done it?

Using masks doesn't have this issue for some reason:

# Use a mask instead...
ds["air_new"] = ds.air.copy().assign_coords({"time_float": ds.time.astype(float)})
mask = xr.ones_like(ds["air_new"], dtype=bool)
mask.loc[{"time": ds.time.isel(time=0)}] = False
ds["air_new"] = ds["air_new"].where(mask, 3 * ds["air_new"])

Why would I want to pass around helper coordinates?

benbovy commented 1 year ago

Yes helper coordinates are useful. However I don't really understand why do you want to pass labels to .loc for all of those coordinates (or all the indexed ones) at once?

Since your helper coordinates are all derived from the same original coordinate, couldn’t you just select data using the one that is the most convenient to you ? E.g.,

ds.air_new.loc[{"time": air_new_changed.time.values}] = air_new_changed

This should yield the same results (and it is much more elegant) than

ds.air_new.loc[
    {
        k: air_new_changed.coords[k].values
        for k in air_new_changed.xindexes.keys()
    }
] = air_new_changed

Both the "time" and "time_float" coordinates share the same dimension and should be subset / propagated the same way.

I doubt that you want to update values on a selection using arbitrary (possibly non-overlapping) labels for each of those coordinates, do you? E.g.,

ds.air_new.loc[{"time"="2013-01-01", "time_float"=1.42e18}] = …

We don’t support that (yet) because it is more complicated. How should we join the integer array indices selected on the "time" dimension from the given "time" and "time_float" labels? By union, intersection or difference? Or do we want an exact match and raise otherwise? It might get even more complicated when multiple indexes, coordinates and/or dimensions are involved. In theory, we could support it but it would represent quite a lot of work and the implementation would quickly get much more complicated.

I can imagine how just passing all coordinates like

ds.air_new.loc[air_new_changed.coords] = air_new_changed 

would be convenient as (with a carefully chosen default behavior) it removes some manual steps or thinking, similarly to Xarray auto-alignment. However, after we start considering more use cases we will quickly see that there is no easy generic solution, like for alignment (#7045). Note that we also don't support alignment when multiple (independent) indexes are found along the same dimension (example here) for the same reasons.