pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Automatically create `xindex`? #9703

Open max-sixty opened 2 weeks ago

max-sixty commented 2 weeks ago

Is your feature request related to a problem?

I'm trying to use xindex more. Currently, trying to select values using coordinates that haven't been explicitly indexed via set_xindex() raises:

ds = xr.tutorial.open_dataset("air_temperature").assign_coords(lat2=lambda x: x.lat)

ds
# Output:
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

# Attempting to select using the unindexed coordinate raises an error:
ds.sel(lat2=75)
# Output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 ds.sel(lat2=75)

File ~/workspace/xarray/xarray/core/dataset.py:3223, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   3155 """Returns a new dataset with each array indexed by tick labels
   3156 along the specified dimension(s).
   3157
   (...)
   3220
   3221 """
   3222 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 3223 query_results = map_index_queries(
   3224     self, indexers=indexers, method=method, tolerance=tolerance
   3225 )
   3227 if drop:
   3228     no_scalar_variables = {}

File ~/workspace/xarray/xarray/core/indexing.py:186, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    183     options = {"method": method, "tolerance": tolerance}
    185 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "map_index_queries")
--> 186 grouped_indexers = group_indexers_by_index(obj, indexers, options)
    188 results = []
    189 for index, labels in grouped_indexers:

File ~/workspace/xarray/xarray/core/indexing.py:145, in group_indexers_by_index(obj, indexers, options)
    143     grouped_indexers[index_id][key] = label
    144 elif key in obj.coords:
--> 145     raise KeyError(f"no index found for coordinate {key!r}")
    146 elif key not in obj.dims:
    147     raise KeyError(
    148         f"{key!r} is not a valid dimension or coordinate for "
    149         f"{obj.__class__.__name__} with dimensions {obj.dims!r}"
    150     )

KeyError: "no index found for coordinate 'lat2'"

After explicitly setting the index, it works as expected:

ds.set_xindex('lat2').sel(lat2=75)
# Output:
<xarray.Dataset> Size: 1MB
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 4B 75.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     float32 4B 75.0
Data variables:
    air      (time, lon) float64 1MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

It's a bit annoying — frequently I attempt to select something, realize it doesn't have an index, add the .set_xindex call, try and remember to add each one at object creation, feel like xarray isn't being as helpful as it could be.

Describe the solution you'd like

Could we instead set the xindex automatically when calling .sel

Possibly we want to force the user to create this once, rather than paying the cost of creating a new index on each call? But OTOH it seems relatively cheap?

%timeit ds.assign_coords(lat2=ds.lat + 2).set_xindex('lat2')

349 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(I guess it could be possible to update a cache in place, and then creating a new index from the cache would be very cheap. Though also possibly that's a source of quite confusing behavior if our implementation is in any way wrong / people are sharing objects across threads etc — i.e. the principle of "don't update in place" is useful)

Describe alternatives you've considered

A set_xindex(...) param (i.e. literally an ellipsis ...) that just creates all the indexes that it can, and folks could call after creating an object?

Additional context

No response

headtr1ck commented 2 weeks ago

Somehow I remember that this came up already a year ago or so. But I cannot seem to find the issue...

I think that this would be a great addition.

shoyer commented 2 weeks ago

👍 for automatically creating indexes when needed.

I would not modify the xarray object in place. Users can do this if they need the performance gains.

max-sixty commented 2 weeks ago

One quick thought: should we add them when creating the object?

headtr1ck commented 2 weeks ago

Might be related: https://github.com/pydata/xarray/issues/8028