Open chrisroat opened 4 years ago
Hi @chrisroat, thanks for the clear bug report!
It indeed be nice if squeeze
followed by expand_dims
preserved the original inputs, but I don't think that is possible in general -- the squeeze
operation removes information.
For example, this array does currently satisfy your desired property, but wouldn't if we made the change you request:
arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'e': 10})
arr2 = arr1.squeeze('y').expand_dims('y')
xr.testing.assert_identical(arr1, arr2) # passes
I suspect our best option for achieving this behavior would be to add another optional argument to expand_dims
, e.g., perhaps
expand_dims(..., expand_coords=False)
: don't expand coordinates (default behavior)expand_dims(..., expand_coords=True)
: expand all coordinatesexpand_dims(..., expand_coords=['e'])
: only expand the coordinate 'e'
I'm not a huge fan of adding arguments for a case that rarely comes up (I presume).
One difference in your example is that the 'e' coord is never based on 'y', so I would not want it expanded -- so I'd still like that test to pass.
The case I'm interested in is where the non-dimension coords are based on existing dimension coords that gets squeezed.
So in this example:
import numpy as np
import xarray as xr
arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'y': [42], 'e': ('y', [10])})
arr1.squeeze()
The squeezed array looks like:
<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
y int64 42
e int64 10
Dimensions without coordinates: x
What I think would be more useful:
<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
y int64 42
e (y) int64 10 <---- Note the (y)
Dimensions without coordinates: x
What I think would be more useful:
<xarray.DataArray (x: 5)> array([0., 0., 0., 0., 0.]) Coordinates: y int64 42 e (y) int64 10 <---- Note the (y) Dimensions without coordinates: x
Thanks for clarifying!
One problem with this -- at least for now -- is that xarray currently doesn't allow coordinates on DataArray objects to have dimensions that don't appear on the DataArray itself.
It might also be surprising that this would make squeeze('y')
inconsistent with isel(y=0)
One problem with this -- at least for now -- is that xarray currently doesn't allow coordinates on DataArray objects to have dimensions that don't appear on the DataArray itself.
Ah, then that would be the desire here.
It might also be surprising that this would make
squeeze('y')
inconsistent withisel(y=0)
The suggestion here is that both of these would behave the same. The MCVE was just for the squeeze case, but I expect that isel
and sel
would both allow non-dim coords to maintain the reference to their original dim (even if it becomes a non-dim coord itself).
If we changed isel()
to only modify data variables, then we would be in trouble with something like isel(x=slice(3))
. The coordinate system would be inconsistent if we slice only the data variables but not the coordinates.
There's a bit of a conflict here between two desirable properties:
My mental model of what's happening may not be correct. I did want sel()
, isel()
, and squeeze()
to all operate the same way (and maybe someday even work on non-dim coordinates!). Replacing squeeze()
with isel()
in my initial example gives the same failure, which I would want it to work:
import numpy as np
import xarray as xr
arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'e': ('y', [10])})
arr2 = arr1.isel(y=0).expand_dims('y')
xr.testing.assert_identical(arr1, arr2)
AssertionError: Left and right DataArray objects are not identical
Differing coordinates:
L e (y) int64 10
R e int64 10
The non-dim coordinate e
has forgotten that it was associated with y
. I'd prefer that this association remained.
Where it gets really interesting is in the following example where the non-dim coordinate moves from one dim to another. I understand the logic here (since the isel()
were done in a way that correlates 'y' and 'z'). In my proposal, this would not happen without explicit user intervention -- which may actually be desired here (it's sorta surprising):
import numpy as np
import xarray as xr
arr = xr.DataArray(np.zeros((2, 2, 5)), dims=['z', 'y', 'x'], coords={'e': ('y', [10, 20])})
print(arr.coords)
print()
arr0 = arr.isel(z=0,y=0)
arr1 = arr.isel(z=1,y=1)
arr_concat = xr.concat([arr0, arr1], 'z')
print(arr_concat.coords)
Coordinates:
e (y) int64 10 20
Coordinates:
e (z) int64 10 20
How does the result of arr1.squeeze()
know that it has e
as a scalar coordinate? Whatever that mechanism, shouldn't the act of squeezing propagate down to arr1["e"]
so that it too knows it has a scalar coordinate y
?
The representation of arr1.squeeze("y")
that OP thought would be more useful captures the idea (but needs some tweaking):
<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
y int64 42
e (y) int64 10 <---- Note the (y)
Dimensions without coordinates: x
The (y)
in there would only be right if y
were still a length-one coordinate, which it isn't. I guess I would expect arr1.squeeze("y")["e"]
to become
<xarray.DataArray 'e' ()>
array(10)
Coordinates:
e int64 10
y int64 42
Whoa. I just realized arr1["e"]["e"]["e"]["e"]...
is valid. It's turtles all the way down.
I'm not sure if this is a bug or a feature, to be honest. I noted that expanded scalar coordinates would retain their scalar values, and so thought that non-dimension coordinates might keep their references to each other and do the same.
What happened:
When a dimension is squeezed or selected with a scalar, the associated non-dimension coordinates are unassociated from the dimension. When the dimension is expanded later, the previously associated non-dimension coordinates are not expanded.
What you expected to happen:
The previously associated non-dimension coordinates should be expanded.
Minimal Complete Verifiable Example:
Error:
Taken another way, I would desire these statements to be possible:
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-48-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.5.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: None seaborn: 0.11.0 numbagg: None pint: None setuptools: 50.3.0 pip: 20.2.3 conda: None pytest: 6.1.1 IPython: 7.18.1 sphinx: 3.2.1