Using a tuple as a sequence in DataArray.sel no longer supported?

momchil-flex commented 2 years ago

What happened?

Version 2022.6.0 produces an error when I try something like data_array.sel(coordintate=(val1, val2)). Now this only works if the sequence values are provided as a list instead.

What did you expect to happen?

In previous versions, tuples could also be supplied. However, I've been digging into this a bit, and I understand that there are generally some limitations on using tuples (or rather, they are sometimes overloaded). For example, it seems that in any version, I can't use a tuple as an input coordinate to initialize a DataArray, as I get an error Could not convert tuple of form (dims, data[, attrs, encoding]) (this is known). I wanted to report the current bug however since the behavior is different in 2022.6.0 compared to previous versions, and to clarify whether not supporting tuples as sel coordinates is expected or not. It is not very clear from the error message and from the docs. The example below works on < 2022.6.0 but raises an error on 2022.6.0.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
arr = xr.DataArray(data=np.random.rand(10), coords={"c1": np.arange(10, dtype=np.float64)})
item = arr.sel(c1=(1, 2))

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.13.0-52-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.0 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.6.1 distributed: None matplotlib: 3.5.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 61.2.0 pip: 21.2.4 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: None

benbovy commented 2 years ago

Thanks for the report @momchil-flex. That's definitely a regression.

However, I wonder what should we do: depreciate interpreting tuples as sequences and always consider them as "scalar" values or continue interpreting it differently depending on the cases?

For example, tuples indexer values were (and still are) assumed to be single element values when selecting on a dimension coordinate with a multi-index (although eventually the multi-index dimension coordinate might be depreciated in xarray):

da = xr.DataArray(
    data=range(3),
    dims="x",
    coords={"a": ("x", ["a", "a", "c"]), "b": ("x", [0, 1, 2])},
).set_index(x=["a", "b"])

da
# <xarray.DataArray (x: 3)>
# array([0, 1, 2])
# Coordinates:
#   * x        (x) object MultiIndex
#   * a        (x) <U1 'a' 'a' 'c'
#   * b        (x) int64 0 1 2

da.sel(x=("a", 1))
# <xarray.DataArray ()>
# array(1)
# Coordinates:
#     x        object ('a', 1)
#     a        <U1 'a'
#     b        int64 1

Pros of always treating a tuple as 1-element indexer value:

Clearer
Less special cases to maintain internally in Xarray

Cons:

With flexible indexes, Xarray currently just passes the indexers to the corresponding (custom) indexes, leaving the responsibility to those indexes to process them as they want. Although we might have some control on the behavior of PandasIndex and PandasMultiIndex built-in Xarray, we have no control on 3rd party indexes. Unless we somehow formalize the semantics of the indexer values passed in .sel(), but this could be challenging as there could be many kinds of indexers (scalar types, tuples, lists, slices, numpy arrays, xarray Variable or DataArray objects, etc.).

dcherian commented 2 years ago

I like the idea of just passing tuples through and letting the index deal with it. Just like a MultiIndex, there may be other cases where this makes sense.

For the current PandasIndex maybe we can raise a nicer error in .sel?

pydata / xarray