pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.07k forks source link

assign_coords' behavior depends on input DataArrays #8180

Open yichechang opened 12 months ago

yichechang commented 12 months ago

What happened?

I'm trying to compute masks (from DataArray's data itself) and assign them as coordinates, but it appears that depending on the combination of coords/dims of the computed masks, sometimes .assign_coords will fail.

It seems like

It's a bit hard to describe as I don't know the xarray internal itself, but my self-contained minimal example below should demonstrate the issue much clearer.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)

# this will fail:
data.assign_coords({'mask_d1_m': data.sel(d1='m')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# this will fail too:
data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# but this will work:
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
        'mask_d1_n': data.sel(d1='n')==0
    }
)
# <xarray.DataArray (d1: 2, d2: 3)>
# array([[0, 1, 2],
#        [0, 1, 2]])
# Coordinates:
#   * d1         (d1) <U1 'm' 'n'
#   * d2         (d2) <U1 'a' 'b' 'c'
#     mask_d1_m  (d2) bool True False False
#     mask_d1_n  (d2) bool True False False

MVCE confirmation

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/common.py:615, in DataWithCoords.assign_coords(self, coords, **coords_kwargs)
    613 data = self.copy(deep=False)
    614 results: dict[Hashable, Any] = self._calc_assign_results(coords_combined)
--> 615 data.coords.update(results)
    616 return data

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:177, in Coordinates.update(self, other)
    173 self._maybe_drop_multiindex_coords(set(other_vars))
    174 coords, indexes = merge_coords(
    175     [self.variables, other_vars], priority_arg=1, indexes=self.xindexes
    176 )
--> 177 self._update_coords(coords, indexes)

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:393, in DataArrayCoordinates._update_coords(self, coords, indexes)
    391 coords_plus_data = coords.copy()
    392 coords_plus_data[_THIS_ARRAY] = self._data.variable
--> 393 dims = calculate_dimensions(coords_plus_data)
    394 if not set(dims) <= set(self.dims):
    395     raise ValueError(
    396         "cannot add coordinates with new dimensions to a DataArray"
    397     )

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/variable.py:3209, in calculate_dimensions(variables)
   3207 for dim, size in zip(var.dims, var.shape):
   3208     if dim in scalar_vars:
-> 3209         raise ValueError(
   3210             f"dimension {dim!r} already exists as a scalar variable"
   3211         )
   3212     if dim not in dims:
   3213         dims[dim] = size

ValueError: dimension 'd1' already exists as a scalar variable

Anything else we need to know?

No response

Environment

~/mambaforge/envs/quickquant/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") INSTALLED VERSIONS ------------------ commit: None python: 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:11:32) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2023.1.0 pandas: 1.5.3 numpy: 1.24.0 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.15.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2023.5.0 distributed: 2023.5.0 matplotlib: 3.7.2 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.9.0 cupy: None pint: 0.21 sparse: None flox: None numpy_groupies: None setuptools: 68.1.2 pip: 23.2.1 conda: 23.7.3 pytest: 7.4.1 mypy: None IPython: 8.12.2 sphinx: 4.5.0
welcome[bot] commented 12 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 12 months ago

Thanks for the issue. What would you expect the output to be?

It does seem surprising that passing two arguments succeeds while passing each of them alone succeeds...

A partial look — adding .drop_vars('d1') makes it succeed, because the argument has a d1 dimension.

data.sel(d1='m')==0

<xarray.DataArray (d2: 3)>
array([ True, False, False])
Coordinates:
    d1       <U1 'm'
  * d2       (d2) <U1 'a' 'b' 'c'
[nav] In [36]: data.assign_coords(
          ...:     {
          ...:         'mask_d1_m': (data.sel(d1='n')==0).drop_vars('d1')
          ...: #        'mask_d1_n': data.sel(d1='n')==0
          ...:     }
          ...: )

...though I'm not sure why it succeeds when there are two arguments.

yichechang commented 12 months ago

Hi - Thanks for the reply and testing!

I guess I would expect that all my examples would all fail, or all succeed.

Sorry I didn't include what I would expect... because I didn't really know what to expect since I haven't thought about why it failed. (But it seems like it should fail, as now it became clearer to me that my mask_d1_m should not have dimension d1?). From the error message, I would guess that doing .drop_vars('d1') would fix the problem (which is true as you demonstrated above!), but I couldn't wrap my head around why a DataArray of dim ('d1', 'd2') cannot be passed as coordinates for DataArray of dim ('d1', 'd2')?

Edit: Also just wanted to point out that it's not just having two arguments that prevents it from failing, it has to be that d1 dim has both d1='m' and d1='n'.

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)

# so this will FAIL
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='m')==1,
#                                ^^^
    }
)

# but this will SUCCEED
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='n')==1,
#                                ^^^
    }
)
yichechang commented 11 months ago

Just wanted to give an update: For my own application (computing masks for a DataArray using its own data, and then attach resulting masks to the DataArray as its new coords), I should make sure my masks contain only labels for its dimensions, in the first place. So, something like

# compute mask based on my data
# mask = data.sel(d1=...) > whatever

# remove labels for extra dimensions
mask = mask.drop_vars([v for v in mask.coords if v not in mask.dims]

# assign mask as coords for the original DataArray
data = data.assign_coords({'my_mask': mask})

But the inconsistent behaviors still seem like a bug, or there's some magic happening during the process of combining multiple coords assignments, that are not explicitly documented. It is not too much of an issue, as

  1. the error messages kind of give enough hint on how to do things correctly, and
  2. it's more so just a surprise that I can quickly tell myself that "okay, I'm doing something wrong, but there was some magic that hide this when I assign coords in a certain way. fix it and move on."

Therefore, I'll leave this issue open but please feel free to close it if this isn't something to be fixed. From my point of view, the inconsistency is surprising, but not a major issue. Thank you for taking the time testing!

keewis commented 11 months ago

as far as I can tell, the cause for the surprising/inconsistent behavior is that xr.core.coordinates.create_coords_with_default_indexes (or xr.core.merge.merge_coordinates_without_align, I didn't step through create_coords_with_default_indexes) drops scalar coordinates with conflicting values. cc @benbovy

In your case, I think the easiest way to work around this is to use .sel(..., drop=True) which will not keep the scalar coordinate.

yichechang commented 11 months ago

Hi @keewis - Thank you for the explanation!

Yeah, I was digging into the codebase a little bit, but unfortunately -- as it is probably evident that I'm not that familiar with xarray's internals -- I was a bit lost. Not that I completely understand now, but I am grateful for an explanation so I know I'm not crazy 🤣

Also thank you for the recommended route with the drop kwarg in DataArray.sel method. This is more succinct and can convey my intention more explicitly. Good to learn how to do things correctly... I think I need to sit down and re-learn xarray. I've been using it for a long time, but usually I can get it to do what I want while knowing there must be something I'm still missing.

Really appreciate all of your help : )

benbovy commented 11 months ago

Since we are relaxing the constraints that are related to dimension coordinates (e.g., #7989), I'm wondering if we couldn't also relax the case where a scalar coordinate has the same name as a dimension.

I don't think that this would help much here, though. Using .assign_coords with a dictionary of DataArray objects extracts all their coordinates and tries to merge them with the current Dataset or DataArray. In many cases this is good but sometimes this could have unwanted side effects. Using drop may help, or alternatively we can ignore all the coordinates (and keep the dimension names) by extracting the variable from the DataArray:

data.assign_coords({'mask_d1_m': (data.sel(d1='m')==0).variable})