pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Using a tuple as a sequence in DataArray.sel no longer supported? #6835

Open momchil-flex opened 2 years ago

momchil-flex commented 2 years ago

What happened?

Version 2022.6.0 produces an error when I try something like data_array.sel(coordintate=(val1, val2)). Now this only works if the sequence values are provided as a list instead.

What did you expect to happen?

In previous versions, tuples could also be supplied. However, I've been digging into this a bit, and I understand that there are generally some limitations on using tuples (or rather, they are sometimes overloaded). For example, it seems that in any version, I can't use a tuple as an input coordinate to initialize a DataArray, as I get an error Could not convert tuple of form (dims, data[, attrs, encoding]) (this is known). I wanted to report the current bug however since the behavior is different in 2022.6.0 compared to previous versions, and to clarify whether not supporting tuples as sel coordinates is expected or not. It is not very clear from the error message and from the docs. The example below works on < 2022.6.0 but raises an error on 2022.6.0.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
arr = xr.DataArray(data=np.random.rand(10), coords={"c1": np.arange(10, dtype=np.float64)})
item = arr.sel(c1=(1, 2))

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.13.0-52-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: None xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.0 scipy: 1.8.1 netCDF4: None pydap: None h5netcdf: None h5py: 3.7.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.6.1 distributed: None matplotlib: 3.5.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 61.2.0 pip: 21.2.4 conda: None pytest: 7.1.2 IPython: 8.4.0 sphinx: None
benbovy commented 2 years ago

Thanks for the report @momchil-flex. That's definitely a regression.

However, I wonder what should we do: depreciate interpreting tuples as sequences and always consider them as "scalar" values or continue interpreting it differently depending on the cases?

For example, tuples indexer values were (and still are) assumed to be single element values when selecting on a dimension coordinate with a multi-index (although eventually the multi-index dimension coordinate might be depreciated in xarray):

da = xr.DataArray(
    data=range(3),
    dims="x",
    coords={"a": ("x", ["a", "a", "c"]), "b": ("x", [0, 1, 2])},
).set_index(x=["a", "b"])

da
# <xarray.DataArray (x: 3)>
# array([0, 1, 2])
# Coordinates:
#   * x        (x) object MultiIndex
#   * a        (x) <U1 'a' 'a' 'c'
#   * b        (x) int64 0 1 2

da.sel(x=("a", 1))
# <xarray.DataArray ()>
# array(1)
# Coordinates:
#     x        object ('a', 1)
#     a        <U1 'a'
#     b        int64 1

Pros of always treating a tuple as 1-element indexer value:

Cons:

dcherian commented 2 years ago

I like the idea of just passing tuples through and letting the index deal with it. Just like a MultiIndex, there may be other cases where this makes sense.

For the current PandasIndex maybe we can raise a nicer error in .sel?