Scalar non-dimension coords forget their heritage

pydata / xarray

N-D labeled arrays and datasets in Python

https://xarray.dev

Apache License 2.0

3.65k stars 1.09k forks source link

Scalar non-dimension coords forget their heritage #4501

Open chrisroat opened 4 years ago

chrisroat commented 4 years ago

I'm not sure if this is a bug or a feature, to be honest. I noted that expanded scalar coordinates would retain their scalar values, and so thought that non-dimension coordinates might keep their references to each other and do the same.

What happened:

When a dimension is squeezed or selected with a scalar, the associated non-dimension coordinates are unassociated from the dimension. When the dimension is expanded later, the previously associated non-dimension coordinates are not expanded.

What you expected to happen:

The previously associated non-dimension coordinates should be expanded.

Minimal Complete Verifiable Example:

import numpy as np
import xarray as xr

arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'e': ('y', [10])})
arr2 = arr1.squeeze('y').expand_dims('y')

xr.testing.assert_identical(arr1, arr2)

Error:

AssertionError: Left and right DataArray objects are not identical

Differing coordinates:
L   e        (y) int64 10
R   e        int64 10

Taken another way, I would desire these statements to be possible:

xr.DataArray(np.zeros(5), dims=['x'], coords={'y': 0, 'a': ('y', 1)})
xr.DataArray(np.zeros((0, 5)), dims=['y', 'x'], coords={'e': ('y', 10)})

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-48-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None xarray: 0.16.1 pandas: 1.1.3 numpy: 1.19.2 scipy: 1.5.2 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.5.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.30.0 distributed: 2.30.0 matplotlib: 3.3.2 cartopy: None seaborn: 0.11.0 numbagg: None pint: None setuptools: 50.3.0 pip: 20.2.3 conda: None pytest: 6.1.1 IPython: 7.18.1 sphinx: 3.2.1

shoyer commented 4 years ago

Hi @chrisroat, thanks for the clear bug report!

It indeed be nice if squeeze followed by expand_dims preserved the original inputs, but I don't think that is possible in general -- the squeeze operation removes information.

For example, this array does currently satisfy your desired property, but wouldn't if we made the change you request:

arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'e': 10})
arr2 = arr1.squeeze('y').expand_dims('y')
xr.testing.assert_identical(arr1, arr2)  # passes

I suspect our best option for achieving this behavior would be to add another optional argument to expand_dims, e.g., perhaps

expand_dims(..., expand_coords=False): don't expand coordinates (default behavior)
expand_dims(..., expand_coords=True): expand all coordinates
expand_dims(..., expand_coords=['e']): only expand the coordinate 'e'

chrisroat commented 4 years ago

I'm not a huge fan of adding arguments for a case that rarely comes up (I presume).

One difference in your example is that the 'e' coord is never based on 'y', so I would not want it expanded -- so I'd still like that test to pass.

The case I'm interested in is where the non-dimension coords are based on existing dimension coords that gets squeezed.

So in this example:

import numpy as np
import xarray as xr

arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'y': [42], 'e': ('y', [10])})
arr1.squeeze()

The squeezed array looks like:

<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
    y        int64 42
    e        int64 10
Dimensions without coordinates: x

What I think would be more useful:

<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
    y        int64 42
    e        (y)  int64 10     <---- Note the (y)
Dimensions without coordinates: x

shoyer commented 4 years ago

What I think would be more useful:

<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
    y        int64 42
    e        (y)  int64 10     <---- Note the (y)
Dimensions without coordinates: x

Thanks for clarifying!

One problem with this -- at least for now -- is that xarray currently doesn't allow coordinates on DataArray objects to have dimensions that don't appear on the DataArray itself.

It might also be surprising that this would make squeeze('y') inconsistent with isel(y=0)

chrisroat commented 4 years ago

One problem with this -- at least for now -- is that xarray currently doesn't allow coordinates on DataArray objects to have dimensions that don't appear on the DataArray itself.

Ah, then that would be the desire here.

It might also be surprising that this would make squeeze('y') inconsistent with isel(y=0)

The suggestion here is that both of these would behave the same. The MCVE was just for the squeeze case, but I expect that isel and sel would both allow non-dim coords to maintain the reference to their original dim (even if it becomes a non-dim coord itself).

shoyer commented 4 years ago

If we changed isel() to only modify data variables, then we would be in trouble with something like isel(x=slice(3)). The coordinate system would be inconsistent if we slice only the data variables but not the coordinates.

There's a bit of a conflict here between two desirable properties:

Methods on a Dataset only modify data variables, leaving coordinates unchanged
Methods on a Dataset keep the entire coordinate system for the Dataset consistent, including coordinates

chrisroat commented 4 years ago

My mental model of what's happening may not be correct. I did want sel(), isel(), and squeeze() to all operate the same way (and maybe someday even work on non-dim coordinates!). Replacing squeeze() with isel() in my initial example gives the same failure, which I would want it to work:

import numpy as np
import xarray as xr

arr1 = xr.DataArray(np.zeros((1,5)), dims=['y', 'x'], coords={'e': ('y', [10])})
arr2 = arr1.isel(y=0).expand_dims('y')
xr.testing.assert_identical(arr1, arr2)

AssertionError: Left and right DataArray objects are not identical

Differing coordinates:
L   e        (y) int64 10
R   e        int64 10

The non-dim coordinate e has forgotten that it was associated with y. I'd prefer that this association remained.

Where it gets really interesting is in the following example where the non-dim coordinate moves from one dim to another. I understand the logic here (since the isel() were done in a way that correlates 'y' and 'z'). In my proposal, this would not happen without explicit user intervention -- which may actually be desired here (it's sorta surprising):

import numpy as np
import xarray as xr

arr = xr.DataArray(np.zeros((2, 2, 5)), dims=['z', 'y', 'x'], coords={'e': ('y', [10, 20])})
print(arr.coords)
print()

arr0 = arr.isel(z=0,y=0)
arr1 = arr.isel(z=1,y=1)

arr_concat = xr.concat([arr0, arr1], 'z')
print(arr_concat.coords)

Coordinates:
    e        (y) int64 10 20

Coordinates:
    e        (z) int64 10 20

itcarroll commented 1 year ago

How does the result of arr1.squeeze() know that it has e as a scalar coordinate? Whatever that mechanism, shouldn't the act of squeezing propagate down to arr1["e"] so that it too knows it has a scalar coordinate y?

The representation of arr1.squeeze("y") that OP thought would be more useful captures the idea (but needs some tweaking):

<xarray.DataArray (x: 5)>
array([0., 0., 0., 0., 0.])
Coordinates:
    y        int64 42
    e        (y)  int64 10     <---- Note the (y)
Dimensions without coordinates: x

The (y) in there would only be right if y were still a length-one coordinate, which it isn't. I guess I would expect arr1.squeeze("y")["e"] to become

<xarray.DataArray 'e' ()>
array(10)
Coordinates:
    e        int64 10
    y        int64 42

Whoa. I just realized arr1["e"]["e"]["e"]["e"]... is valid. It's turtles all the way down.