pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Repeated coordinates leads to unintuitive (broken?) indexing behaviour #3731

Open ivirshup opened 4 years ago

ivirshup commented 4 years ago

MCVE Code Sample

import xarray as xr
import numpy as np

sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx))

da.shape
# (3, 3)
da[1, :].shape
# (3, 3)
da.loc["a", :].shape
# (3, 3)
da.loc[:, "a"].shape
# ()
da[:, 0].shape 
# ()
da[:, 1]
# <xarray.DataArray ()>
# array(1.)
# Coordinates:
#    sample_id  <U1 'b'

Expected Output

I had expected:

da.shape
# (3, 3)
da[1, :].shape
# (3)
da.loc["a", :].shape
# (3)
da.loc[:, "a"].shape
# (3)
da[:, 1]
# <xarray.DataArray (sample_id: 3)>
# array([0., 1., 0.])
# Coordinates:
#    sample_id  <U1 'a' 'b' 'c'

Problem Description

When coordinates are shared between dimensions (as would happen if a pairwise measurement is taken) indexing behaves strangely. It looks like indexing into the initial indices doesn't do anything, while indexing into the last index applies the selection across all dimensions.

da3d = xr.DataArray(
    np.arange(27).reshape((3,3,3)),
    coords=(sample_idx, sample_idx, sample_idx)
)

print(da3d.loc["a"].shape)
print(da3d.loc["a", "a"].shape)
print(da3d.loc[:, :, "a"].shape)
# (3, 3, 3)
# (3, 3, 3)
# ()

Output of xr.show_versions()

# Paste the output here xr.show_versions() here ``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 (default, Jan 4 2020, 12:18:30) [Clang 11.0.0 (clang-1100.0.33.16)] python-bits: 64 OS: Darwin OS-release: 19.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.2 libnetcdf: 4.6.3 xarray: 0.14.1+5.gb0064b25 pandas: 0.25.3 numpy: 1.18.1 scipy: 1.4.1 netCDF4: 1.5.2 pydap: None h5netcdf: 0.7.4 h5py: 2.9.0 Nio: None zarr: 2.4.0 cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.9.2 distributed: 2.9.3 matplotlib: 3.0.3 cartopy: None seaborn: 0.10.0 numbagg: None setuptools: 44.0.0 pip: 20.0.2 conda: None pytest: 5.3.4 IPython: 7.11.1 sphinx: 2.3.1 ```

Update, adding link to related issue:

TomNicholas commented 4 years ago

Thanks for this @ivirshup , I'm surprised at this too.

The problem seems to be that the DataArray you've managed to create breaks xarray's own data model! There should be one dim for each axis of the wrapped array, but

import xarray as xr
import numpy as np

sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx)
print(da)

gives a dataarray object which somehow has only one dim while wrapping a 2D array!

<xarray.DataArray (sample_id: 3)>
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Coordinates:
  * sample_id  (sample_id) <U1 'a' 'b' 'c'

Obviously xarray should have thrown you an error before allowing you to create this. It's no wonder the indexing is weird after this point.

I would have expected to get an array with two dims, which you can do by being more explicit:

da2d = xr.DataArray(np.eye(3), dims=['dim0', 'dim1'], coords=(sample_idx, sample_idx)) 
print(da2d)
<xarray.DataArray (dim0: 3, dim1: 3)>
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Coordinates:
  * dim0     (dim0) <U1 'a' 'b' 'c'
  * dim1     (dim1) <U1 'a' 'b' 'c'

(the coordinates aren't named how you want yet which is also a problem but at least this has a number of dimensions consistent with the data its wrapping.)

Indexing that object behaves more like you (and I) would expect:

da.shape
# (3, 3)

da[1, :].shape
# (3,)

da.loc["a", :].shape
# (3,)

da.loc[:, "a"].shape
# (3,)

da[:, 1]
<xarray.DataArray (dim0: 3)>
array([0., 1., 0.])
Coordinates:
  * dim0     (dim0) <U1 'a' 'b' 'c'
    dim1     <U1 'b'

It also doesn't fit xarray's data model to have two coordinates along different dimensions with the same name as one another. I suggest that you create two separate coords (i.e. sample_idx0 and sample_idx1), and assign them to each dim. Then you should be able to do what you want without weird behaviour.

(We should also fix DataArray.__init__() so that you can't construct this)

dcherian commented 4 years ago

Dupe of #2226 and #1499 . I agree with failing loudly in the constructor

ivirshup commented 4 years ago

Why not allow multiple dimensions with the same name? They can be disambiguated with positional indexing for when it matters. I think support for this would be useful for pairwise measures.

Here's a fun example/ current buggy behaviour:

import numpy as np
import xarray as xr
from string import ascii_letters

idx1 = xr.IndexVariable("dim1", [f"dim1-{i}" for i in ascii_letters[:10]])
idx2 = xr.IndexVariable("dim2", [f"dim2-{i}" for i in ascii_letters[:5]])

da1 = xr.DataArray(np.random.random_sample((10, 5)), coords=(idx1, idx2))
da2 = xr.DataArray(np.random.random_sample((5, 10)), coords=(idx2, idx1))

da1 @ da2
# <xarray.DataArray ()>
# array(13.06261098)
TomNicholas commented 4 years ago

Why not allow multiple dimensions with the same name? They can be disambiguated with positional indexing for when it matters.

I'm not sure it's that simple... What would you suggest the behaviour for da.isel(dim='ambiguous_dim') or da.mean(dim='ambiguous_dim') be?

ivirshup commented 4 years ago

This has also come up over in DimensionalData.jl, which I think is going for behavior I like. What I think would happen:

da.isel(dim='ambiguous_dim')

The selection is over all dimensions of that name.

da.mean(dim='ambiguous_dim')

The command is to reduce over dimensions of that name, the reduction should be performed over all dimensions with that name.

delgadom commented 2 years ago

ooh this is a fun one! came across this issue when we stumbled across a pendantic case writing tests (H/T @brews). I expected this to "fail loudly in the constructor" but it doesn't. note that currently AFAICT you cannot use positional slicing to achieve an intuitive result - the behavior seems more undefined/unpredictable

# setup
import xarray as xr, pandas as pd, numpy as np
da = xr.DataArray(np.arange(8).reshape(2, 2, 2), coords=[[0, 1], [0, 1], ['a', 'b']], dims=["ni", "ni", "shh"])

xarray seems to not know it has a problem:

In [4]: da
Out[4]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
Coordinates:
  * ni       (ni) int64 0 1
  * shh      (shh) <U1 'a' 'b'

slicing (somewhat intuitively? slices along both dims):

In [5]: da.sel(ni=0)
Out[5]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
    ni       int64 0
  * shh      (shh) <U1 'a' 'b'

however, positional slicing (and any attempts I've made to handle the repeated dims differently) seems to have undefined behavior:

In [6]: da[0, :, :]  # positional slicing along first dim works as expected(?)
Out[6]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])
Coordinates:
  * ni       (ni) int64 0 1
  * shh      (shh) <U1 'a' 'b'

In [7]: da[:, 0, :]  # positional slicing along second dim slices both dims
Out[7]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
    ni       int64 0
  * shh      (shh) <U1 'a' 'b'
delgadom commented 2 years ago

FWIW I could definitely see use cases for allowing something like this... I have used cumbersome/ugly workarounds to work with variance-covariance matrices etc. So I'm not weighing in on the "this should raise an error" debate. I got briefly excited when I saw it didn't raise an error, until everything started unraveling 🙃

ilan-gold commented 1 year ago

Are there any thoughts as to how to move forward with this? It seems like a first step might be fixing __getitem__ behavior before moving on. Another option is that we can also place errors everywhere that they should be raised and then move forward, but simply changing the behavior (instead of placing errors) does not really seem like a violation of backwards compat because the current behavior is broken.

In the vein of __getitem__, there's something of an interesting conundrum; from what I can tell from these examples and my own messing around, isel/sel work as intended. The tough part is that indexing via da[sel, sel...] does not work, specifically because it calls isel and this operates on a Mapping type (I am referring broady to all the isel functions). So it seems to me we could proceed by either

  1. Allowing isel to accept a type like Iterable[tuple[Any, Any]] which would allow repeat names. In other words, we wold have something like
    def __getitem__(self, key: Any) -> Self:
        if isinstance(key, str):
            return self._getitem_coord(key)
        if self._has_repeated_dim_names:
            return self.isel(indexers=zip(key[:len(self.dims)], self.dims))
        return self.isel(indexers=self._item_key_to_dict(key))
  2. We write an entirely new function to be used in __getitem__ for this.

At the end of the day, both options will probably involve completely re-writing a function because isel and the stack it relies on internally operate on dicts of indexers but in a way that actually means isel works as intended (see e.g., here for an example of this fortunate coincidence). The difference is just the external API. I am in favor of not changing isel's signature.

From the getter methods, we could then move on to "external" functions like concat. What do we think?

P.S I have seen the discussion in https://github.com/pydata/xarray/issues/1378 about shared indices/splitting and I don't see this as mutually exclusive. I think the two could be quite compatible

TomNicholas commented 1 year ago

Are there any thoughts as to how to move forward with this? It seems like a first step might be fixing __getitem__ behavior before moving on.

The very first step should be raising a loud error in the constructor, to avoid unintended and undefined behaviour afterwards. We should have done that a while ago!

Allowing isel to accept a type like Iterable[tuple[Any, Any]] which would allow repeat names

If you narrow this to Iterable[tuple[Hashable, Hashable]] then it becomes a subtype of Iterable[Hashable], which in theory we have already been attempting to support. However our support for arbitrary Hashable dimension names is IIUC not complete / buggy / raises issues.

Beyond that I think this needs a lot more serious thought as to how hard it would be to support. I wouldn't want to start allowing it on a function-by-function basis until we later realise it doesn't generalize. (Despite your observations about isel being very interesting @ilan-gold !)

I would have lots of questions, for example:

If someone wants to experiment with the above questions, it would make sense to start by doing it in the new NamedArray class, rather than expecting things to work on the composite types DataArray or Dataset.

(Another random thought - if we had a hypothesis-based test suite, we could test this idea via a one-line change to the strategy that generates dimension_names to allow them to repeat...)

TomNicholas commented 1 year ago

I've added a check to raise on repeated dimension names in https://github.com/pydata/xarray/pull/8491. Now if you try @ivirshup 's original example you get a nice clear error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 5
      2 import numpy as np
      4 sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
----> 5 da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx))

File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:446, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
    444 data = as_compatible_data(data)
    445 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
--> 446 variable = Variable(dims, data, attrs, fastpath=True)
    448 if not isinstance(coords, Coordinates):
    449     coords = create_coords_with_default_indexes(coords)

File ~/Documents/Work/Code/xarray/xarray/core/variable.py:365, in Variable.__init__(self, dims, data, attrs, encoding, fastpath)
    338 def __init__(
    339     self,
    340     dims,
   (...)
    344     fastpath=False,
    345 ):
    346     """
    347     Parameters
    348     ----------
   (...)
    363         unrecognized encoding items.
    364     """
--> 365     super().__init__(
    366         dims=dims, data=as_compatible_data(data, fastpath=fastpath), attrs=attrs
    367     )
    369     self._encoding = None
    370     if encoding is not None:

File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:252, in NamedArray.__init__(self, dims, data, attrs)
    245 def __init__(
    246     self,
    247     dims: _DimsLike,
    248     data: duckarray[Any, _DType_co],
    249     attrs: _AttrsLike = None,
    250 ):
    251     self._data = data
--> 252     self._dims = self._parse_dimensions(dims)
    253     self._attrs = dict(attrs) if attrs else None

File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:486, in NamedArray._parse_dimensions(self, dims)
    484 if len(set(dims)) < len(dims):
    485     repeated_dims = set([d for d in dims if dims.count(d) > 1])
--> 486     raise ValueError(
    487         f"Repeated dimension names are forbidden, but dimensions {repeated_dims} appear more than once in dims={dims}"
    488     )
    489 return dims

ValueError: Repeated dimension names are forbidden, but dimensions {'sample_id'} appear more than once in dims=('sample_id', 'sample_id')

I suggest we leave this issue open as the place to discuss supporting repeated dimension names.

TomNicholas commented 1 year ago

Final thought: we might also consider simply making it clearer which parts of xarray should and shouldn't work with repeated dimension names.

Looking in the code at the moment repeated dimension names are not forbidden in the constructors, but are explicitly forbidden during broadcasting. This makes a fair amount of sense, but unfortunately this clarity of what is and isn't allowed doesn't currently extend into other parts of the library. Again NamedArray provides a good opportunity to pull some of these questions out.

max-sixty commented 1 year ago

Unless the dimensions are interchangable (e.g. a symmetric matrix), presumably it would be really difficult to support repeated dimension names? We would then need to guarantee order in operations, which is something we explicitly don't do.

I might think we mark that out of scope for the medium term....


FWIW when I have symmetric matrices, I use foo & foo_, which works well!

ilan-gold commented 1 year ago

Does netCDF support repeated dimension names? Does Zarr?

I am personally not sure about netCDF but it does appear that many people are saying there is no restriction that says you can't. As for zarr, it is agnostic to the names of things in general (so I'm not sure what is being referred to here specifically @TomNicholas), but with this extension for v3, this issue was explicitly discussed and the repeated dimension names are allowed. So I would say, broadly speaking, the answer is "yes" to both.

I will also say that this sort of rationale was the reason I thought a slow build-in would actually not be such a bad idea (instead of failing loudly). We aren't "breaking" anything and some stuff actually does work at the moment. Furthermore, xarray can currently read in the netCDF files (as was pointed out in the linked comment) even if the result is not very helpful (see here for another example of where this sort of "broken" behavior occurs).

Can we define broadcasting via dimension name consistently to work with repeated names? How would concatenation work?

I am not sure. I think we definitely could get these to work as expected, but I also think this gets back to the matter of internal API of DataArray vs. external functions I was mentioning earlier. I think from the ground-up, we'd want getting i.e., ds[sel, sel, ...] to work how we want it to before functions like concat or broadcast. If getting doesn't work, there's probably not much hope for other stuff.

I will definitely think on these more, and fiddle around with NamedArray.

Doesn't this suggestion from @ivirshup

[the repeated dimensions] can be disambiguated with positional indexing

break xarray's property of not caring about transpositions?

Could you link to this? I am not sure what you mean.

Are we sure that simply applying reductions over all repeated dimensions works in all cases?

Not sure! I will experiment with NamedArray or perhaps with a modified version of DataArray to find out what works if we were to have a DataArray whose behavior works as expected "internally" for repeated dimensions.

Does this break invariants in apply_ufunc? e.g. what if I have repeated core_dims, or two identically-named dimensions where one is chunked and the other isn't?

Good question! I will say that I think the point of repeated dimensions is that they are "the same" so I can't say exactly how one could be chunked and the other not (but I am still a newbie with xarray).

Is this something that can be explored via generalizing type hints before actually changing code?

I would be very much so in favor of this. Thanks for linking to the relevant work on this.