Open ivirshup opened 4 years ago
Thanks for this @ivirshup , I'm surprised at this too.
The problem seems to be that the DataArray you've managed to create breaks xarray's own data model! There should be one dim for each axis of the wrapped array, but
import xarray as xr
import numpy as np
sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx)
print(da)
gives a dataarray object which somehow has only one dim while wrapping a 2D array!
<xarray.DataArray (sample_id: 3)>
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
Coordinates:
* sample_id (sample_id) <U1 'a' 'b' 'c'
Obviously xarray should have thrown you an error before allowing you to create this. It's no wonder the indexing is weird after this point.
I would have expected to get an array with two dims, which you can do by being more explicit:
da2d = xr.DataArray(np.eye(3), dims=['dim0', 'dim1'], coords=(sample_idx, sample_idx))
print(da2d)
<xarray.DataArray (dim0: 3, dim1: 3)>
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
Coordinates:
* dim0 (dim0) <U1 'a' 'b' 'c'
* dim1 (dim1) <U1 'a' 'b' 'c'
(the coordinates aren't named how you want yet which is also a problem but at least this has a number of dimensions consistent with the data its wrapping.)
Indexing that object behaves more like you (and I) would expect:
da.shape
# (3, 3)
da[1, :].shape
# (3,)
da.loc["a", :].shape
# (3,)
da.loc[:, "a"].shape
# (3,)
da[:, 1]
<xarray.DataArray (dim0: 3)>
array([0., 1., 0.])
Coordinates:
* dim0 (dim0) <U1 'a' 'b' 'c'
dim1 <U1 'b'
It also doesn't fit xarray's data model to have two coordinates along different dimensions with the same name as one another. I suggest that you create two separate coords (i.e. sample_idx0
and sample_idx1
), and assign them to each dim. Then you should be able to do what you want without weird behaviour.
(We should also fix DataArray.__init__()
so that you can't construct this)
Dupe of #2226 and #1499 . I agree with failing loudly in the constructor
Why not allow multiple dimensions with the same name? They can be disambiguated with positional indexing for when it matters. I think support for this would be useful for pairwise measures.
Here's a fun example/ current buggy behaviour:
import numpy as np
import xarray as xr
from string import ascii_letters
idx1 = xr.IndexVariable("dim1", [f"dim1-{i}" for i in ascii_letters[:10]])
idx2 = xr.IndexVariable("dim2", [f"dim2-{i}" for i in ascii_letters[:5]])
da1 = xr.DataArray(np.random.random_sample((10, 5)), coords=(idx1, idx2))
da2 = xr.DataArray(np.random.random_sample((5, 10)), coords=(idx2, idx1))
da1 @ da2
# <xarray.DataArray ()>
# array(13.06261098)
Why not allow multiple dimensions with the same name? They can be disambiguated with positional indexing for when it matters.
I'm not sure it's that simple... What would you suggest the behaviour for da.isel(dim='ambiguous_dim')
or da.mean(dim='ambiguous_dim')
be?
This has also come up over in DimensionalData.jl, which I think is going for behavior I like. What I think would happen:
da.isel(dim='ambiguous_dim')
The selection is over all dimensions of that name.
da.mean(dim='ambiguous_dim')
The command is to reduce over dimensions of that name, the reduction should be performed over all dimensions with that name.
ooh this is a fun one! came across this issue when we stumbled across a pendantic case writing tests (H/T @brews). I expected this to "fail loudly in the constructor" but it doesn't. note that currently AFAICT you cannot use positional slicing to achieve an intuitive result - the behavior seems more undefined/unpredictable
# setup
import xarray as xr, pandas as pd, numpy as np
da = xr.DataArray(np.arange(8).reshape(2, 2, 2), coords=[[0, 1], [0, 1], ['a', 'b']], dims=["ni", "ni", "shh"])
xarray seems to not know it has a problem:
In [4]: da
Out[4]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
[2, 3]],
[[4, 5],
[6, 7]]])
Coordinates:
* ni (ni) int64 0 1
* shh (shh) <U1 'a' 'b'
slicing (somewhat intuitively? slices along both dims):
In [5]: da.sel(ni=0)
Out[5]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
ni int64 0
* shh (shh) <U1 'a' 'b'
however, positional slicing (and any attempts I've made to handle the repeated dims differently) seems to have undefined behavior:
In [6]: da[0, :, :] # positional slicing along first dim works as expected(?)
Out[6]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
[2, 3]],
[[4, 5],
[6, 7]]])
Coordinates:
* ni (ni) int64 0 1
* shh (shh) <U1 'a' 'b'
In [7]: da[:, 0, :] # positional slicing along second dim slices both dims
Out[7]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
ni int64 0
* shh (shh) <U1 'a' 'b'
FWIW I could definitely see use cases for allowing something like this... I have used cumbersome/ugly workarounds to work with variance-covariance matrices etc. So I'm not weighing in on the "this should raise an error" debate. I got briefly excited when I saw it didn't raise an error, until everything started unraveling 🙃
Are there any thoughts as to how to move forward with this? It seems like a first step might be fixing __getitem__
behavior before moving on. Another option is that we can also place errors everywhere that they should be raised and then move forward, but simply changing the behavior (instead of placing errors) does not really seem like a violation of backwards compat because the current behavior is broken.
In the vein of __getitem__
, there's something of an interesting conundrum; from what I can tell from these examples and my own messing around, isel
/sel
work as intended. The tough part is that indexing via da[sel, sel...]
does not work, specifically because it calls isel
and this operates on a Mapping
type (I am referring broady to all the isel
functions). So it seems to me we could proceed by either
isel
to accept a type like Iterable[tuple[Any, Any]]
which would allow repeat names. In other words, we wold have something like
def __getitem__(self, key: Any) -> Self:
if isinstance(key, str):
return self._getitem_coord(key)
if self._has_repeated_dim_names:
return self.isel(indexers=zip(key[:len(self.dims)], self.dims))
return self.isel(indexers=self._item_key_to_dict(key))
__getitem__
for this.At the end of the day, both options will probably involve completely re-writing a function because isel
and the stack it relies on internally operate on dict
s of indexers but in a way that actually means isel
works as intended (see e.g., here for an example of this fortunate coincidence). The difference is just the external API. I am in favor of not changing isel
's signature.
From the getter methods, we could then move on to "external" functions like concat
. What do we think?
P.S I have seen the discussion in https://github.com/pydata/xarray/issues/1378 about shared indices/splitting and I don't see this as mutually exclusive. I think the two could be quite compatible
Are there any thoughts as to how to move forward with this? It seems like a first step might be fixing
__getitem__
behavior before moving on.
The very first step should be raising a loud error in the constructor, to avoid unintended and undefined behaviour afterwards. We should have done that a while ago!
Allowing isel to accept a type like
Iterable[tuple[Any, Any]]
which would allow repeat names
If you narrow this to Iterable[tuple[Hashable, Hashable]]
then it becomes a subtype of Iterable[Hashable]
, which in theory we have already been attempting to support. However our support for arbitrary Hashable
dimension names is IIUC not complete / buggy / raises issues.
Beyond that I think this needs a lot more serious thought as to how hard it would be to support. I wouldn't want to start allowing it on a function-by-function basis until we later realise it doesn't generalize. (Despite your observations about isel
being very interesting @ilan-gold !)
I would have lots of questions, for example:
Doesn't this suggestion from @ivirshup
[the repeated dimensions] can be disambiguated with positional indexing
break xarray's property of not caring about transpositions?
apply_ufunc
? e.g. what if I have repeated core_dims
, or two identically-named dimensions where one is chunked and the other isn't?If someone wants to experiment with the above questions, it would make sense to start by doing it in the new NamedArray
class, rather than expecting things to work on the composite types DataArray
or Dataset
.
(Another random thought - if we had a hypothesis-based test suite, we could test this idea via a one-line change to the strategy that generates dimension_names
to allow them to repeat...)
I've added a check to raise on repeated dimension names in https://github.com/pydata/xarray/pull/8491. Now if you try @ivirshup 's original example you get a nice clear error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 5
2 import numpy as np
4 sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
----> 5 da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx))
File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:446, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
444 data = as_compatible_data(data)
445 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
--> 446 variable = Variable(dims, data, attrs, fastpath=True)
448 if not isinstance(coords, Coordinates):
449 coords = create_coords_with_default_indexes(coords)
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:365, in Variable.__init__(self, dims, data, attrs, encoding, fastpath)
338 def __init__(
339 self,
340 dims,
(...)
344 fastpath=False,
345 ):
346 """
347 Parameters
348 ----------
(...)
363 unrecognized encoding items.
364 """
--> 365 super().__init__(
366 dims=dims, data=as_compatible_data(data, fastpath=fastpath), attrs=attrs
367 )
369 self._encoding = None
370 if encoding is not None:
File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:252, in NamedArray.__init__(self, dims, data, attrs)
245 def __init__(
246 self,
247 dims: _DimsLike,
248 data: duckarray[Any, _DType_co],
249 attrs: _AttrsLike = None,
250 ):
251 self._data = data
--> 252 self._dims = self._parse_dimensions(dims)
253 self._attrs = dict(attrs) if attrs else None
File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:486, in NamedArray._parse_dimensions(self, dims)
484 if len(set(dims)) < len(dims):
485 repeated_dims = set([d for d in dims if dims.count(d) > 1])
--> 486 raise ValueError(
487 f"Repeated dimension names are forbidden, but dimensions {repeated_dims} appear more than once in dims={dims}"
488 )
489 return dims
ValueError: Repeated dimension names are forbidden, but dimensions {'sample_id'} appear more than once in dims=('sample_id', 'sample_id')
I suggest we leave this issue open as the place to discuss supporting repeated dimension names.
Final thought: we might also consider simply making it clearer which parts of xarray should and shouldn't work with repeated dimension names.
Looking in the code at the moment repeated dimension names are not forbidden in the constructors, but are explicitly forbidden during broadcasting. This makes a fair amount of sense, but unfortunately this clarity of what is and isn't allowed doesn't currently extend into other parts of the library. Again NamedArray
provides a good opportunity to pull some of these questions out.
Unless the dimensions are interchangable (e.g. a symmetric matrix), presumably it would be really difficult to support repeated dimension names? We would then need to guarantee order in operations, which is something we explicitly don't do.
I might think we mark that out of scope for the medium term....
FWIW when I have symmetric matrices, I use foo
& foo_
, which works well!
Does netCDF support repeated dimension names? Does Zarr?
I am personally not sure about netCDF but it does appear that many people are saying there is no restriction that says you can't. As for zarr, it is agnostic to the names of things in general (so I'm not sure what is being referred to here specifically @TomNicholas), but with this extension for v3, this issue was explicitly discussed and the repeated dimension names are allowed. So I would say, broadly speaking, the answer is "yes" to both.
I will also say that this sort of rationale was the reason I thought a slow build-in would actually not be such a bad idea (instead of failing loudly). We aren't "breaking" anything and some stuff actually does work at the moment. Furthermore, xarray can currently read in the netCDF files (as was pointed out in the linked comment) even if the result is not very helpful (see here for another example of where this sort of "broken" behavior occurs).
Can we define broadcasting via dimension name consistently to work with repeated names? How would concatenation work?
I am not sure. I think we definitely could get these to work as expected, but I also think this gets back to the matter of internal API of DataArray
vs. external functions I was mentioning earlier. I think from the ground-up, we'd want getting i.e., ds[sel, sel, ...]
to work how we want it to before functions like concat
or broadcast
. If getting doesn't work, there's probably not much hope for other stuff.
I will definitely think on these more, and fiddle around with NamedArray
.
Doesn't this suggestion from @ivirshup
[the repeated dimensions] can be disambiguated with positional indexing
break xarray's property of not caring about transpositions?
Could you link to this? I am not sure what you mean.
Are we sure that simply applying reductions over all repeated dimensions works in all cases?
Not sure! I will experiment with NamedArray
or perhaps with a modified version of DataArray
to find out what works if we were to have a DataArray
whose behavior works as expected "internally" for repeated dimensions.
Does this break invariants in apply_ufunc? e.g. what if I have repeated core_dims, or two identically-named dimensions where one is chunked and the other isn't?
Good question! I will say that I think the point of repeated dimensions is that they are "the same" so I can't say exactly how one could be chunked and the other not (but I am still a newbie with xarray
).
Is this something that can be explored via generalizing type hints before actually changing code?
I would be very much so in favor of this. Thanks for linking to the relevant work on this.
MCVE Code Sample
Expected Output
I had expected:
Problem Description
When coordinates are shared between dimensions (as would happen if a pairwise measurement is taken) indexing behaves strangely. It looks like indexing into the initial indices doesn't do anything, while indexing into the last index applies the selection across all dimensions.
Output of
xr.show_versions()
Update, adding link to related issue: