pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

unstack confusing re `Variable` / `IndexVariable` #9190

Closed brianpm closed 3 months ago

brianpm commented 5 months ago

What happened?

using unstack on a DataArray generated using the .dt.daysinmonth accessor with time as a multiIndex fails with a ValueError. The mysterious part is that when I build an "identical" DataArray starting from the .data of that same array, it works as expected (see output of example code).

I asked a colleague for help with this, and she said the attached code worked for older versions of xarray, but said it seems to be broken starting at 2023.5.0.

What did you expect to happen?

Expected to get a DataArray (days0) with dimensions ('year', 'month') with sizes (2, 12), which is what I get with the alternate DataArray (called days).

Minimal Complete Verifiable Example

import sys
print(f"python {sys.version}")
import xarray as xr
import numpy as np
import cftime
print(f"numpy: {np.__version__}, xarray: {xr.__version__}, cftime: {cftime.__version__}")
t = np.array([cftime.DatetimeGregorian(1979, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 2, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 3, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 4, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 6, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 7, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 8, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 9, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 10, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 11, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1979, 12, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 2, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 3, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 4, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 5, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 6, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 7, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 8, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 9, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 10, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 11, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(1980, 12, 1, 0, 0, 0, 0, has_year_zero=False)])
dss = xr.DataArray(t, dims=['time'], coords={"time":t})

# TWO VERSIONS OF "days":
days0 = dss['time'].dt.daysinmonth

days = xr.DataArray(dss['time'].dt.daysinmonth.data, dims=['time'], coords={'time':dss['time']}, attrs=days0.attrs, name='days_in_month')

print(f"IDENTICAL: {days.identical(days0)}")

year = dss['time'].dt.year.data
month = dss['time'].dt.month.data

# REPEAT SAME STEPS FOR days and days0:
days = days.assign_coords(year=("time", year), month=("time", month))
days = days.set_index(time=['year', 'month'])

days0 = days0.assign_coords(year=("time", year), month=("time", month))
days0 = days0.set_index(time=['year', 'month'])

print(f"IDENTICAL: {days.identical(days0)}")

days = days.unstack('time') # THIS WORKS
print(f"{days.dims = }")
#
days0 = days0.unstack('time') # THIS FAILS
print(f"{days0.dims = }")

MVCE confirmation

Relevant log output

python 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:36:57) [Clang 15.0.7 ]
numpy: 1.26.4, xarray: 2024.5.0, cftime: 1.6.3
IDENTICAL: True
IDENTICAL: True
days.dims = ('year', 'month')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 55
     53 print(f"{days.dims = }")
     54 #
---> 55 days0 = days0.unstack('time') # THIS FAILS
     56 print(f"{days0.dims = }")

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    111     kwargs.update({name: arg for name, arg in zip_args})
    113     return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataarray.py:2950, in DataArray.unstack(self, dim, fill_value, sparse)
   2888 @_deprecate_positional_args("v2023.10.0")
   2889 def unstack(
   2890     self,
   (...)
   2894     sparse: bool = False,
   2895 ) -> Self:
   2896     """
   2897     Unstack existing dimensions corresponding to MultiIndexes into
   2898     multiple new dimensions.
   (...)
   2948     DataArray.stack
   2949     """
-> 2950     ds = self._to_temp_dataset().unstack(dim, fill_value=fill_value, sparse=sparse)
   2951     return self._from_temp_dataset(ds)

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    111     kwargs.update({name: arg for name, arg in zip_args})
    113     return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataset.py:5663, in Dataset.unstack(self, dim, fill_value, sparse)
   5659         result = result._unstack_full_reindex(
   5660             d, stacked_indexes[d], fill_value, sparse
   5661         )
   5662     else:
-> 5663         result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse)
   5664 return result

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/dataset.py:5496, in Dataset._unstack_once(self, dim, index_and_vars, fill_value, sparse)
   5493     else:
   5494         fill_value_ = fill_value
-> 5496     variables[name] = var._unstack_once(
   5497         index=clean_index,
   5498         dim=dim,
   5499         fill_value=fill_value_,
   5500         sparse=sparse,
   5501     )
   5502 else:
   5503     variables[name] = var

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:1552, in Variable._unstack_once(self, index, dim, fill_value, sparse)
   1547     # Indexer is a list of lists of locations. Each list is the locations
   1548     # on the new dimension. This is robust to the data being sparse; in that
   1549     # case the destinations will be NaN / zero.
   1550     data[(..., *indexer)] = reordered
-> 1552 return self._replace(dims=new_dims, data=data)

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:957, in Variable._replace(self, dims, data, attrs, encoding)
    955 if encoding is _default:
    956     encoding = copy.copy(self._encoding)
--> 957 return type(self)(dims, data, attrs, encoding, fastpath=True)

File ~/opt/miniconda3/envs/p12/lib/python3.12/site-packages/xarray/core/variable.py:2625, in IndexVariable.__init__(self, dims, data, attrs, encoding, fastpath)
   2623 super().__init__(dims, data, attrs, encoding, fastpath)
   2624 if self.ndim != 1:
-> 2625     raise ValueError(f"{type(self).__name__} objects must be 1-dimensional")
   2627 # Unlike in Variable, always eagerly load values into memory
   2628 if not isinstance(self._data, PandasIndexingAdapter):

ValueError: IndexVariable objects must be 1-dimensional

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.0 | packaged by conda-forge | (main, Oct 3 2023, 08:36:57) [Clang 15.0.7 ] python-bits: 64 OS: Darwin OS-release: 23.5.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: None LOCALE: (None, 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.5.0 pandas: 2.2.2 numpy: 1.26.4 scipy: 1.13.0 netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: 1.6.3 nc_time_axis: 1.4.1 iris: None bottleneck: 1.3.8 dask: 2024.5.0 distributed: 2024.5.0 matplotlib: 3.8.4 cartopy: 0.23.0 seaborn: None numbagg: None fsspec: 2024.5.0 cupy: None pint: 0.24.1 sparse: 0.15.1 flox: None numpy_groupies: None setuptools: 69.5.1 pip: 24.0 conda: None pytest: None mypy: None IPython: 8.24.0 sphinx: None
welcome[bot] commented 5 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

max-sixty commented 5 months ago

Can we strip off much more from the example?

I see days and days0 are quite different — can we make them more any more similar and still see the failure? Does it require using cftime?

spencerkclark commented 5 months ago

This is maybe a more minimal example—it does not require cftime or times in general:

source = xr.DataArray(range(2), dims=["x"], coords=[["a", "b"]])
da = source.x
da = da.assign_coords(y=("x", ["c", "d"]), z=("x", ["e", "f"]))
da = da.set_index(x=["y", "z"])
da.unstack("x")

I think the issue relates to the fact that da.variable is an IndexVariable instead of a Variable. I'd have to do more digging to see if there was a time that this worked.

The v2023.5.0 breakpoint in the original example is maybe a bit of a red herring in that it appears that dss.time.dt.daysinmonth switched from returning a Variable-backed DataArray to an IndexVariable-backed DataArray at that time.

brianpm commented 5 months ago

Thanks @spencerkclark -- that's a better minimal example and diagnosis. I couldn't figure out how to tell the difference between days and days0.

kafitzgerald commented 5 months ago

Just to confirm, I went through and tested this out quickly and Spencer's example does indeed fail in an older version as well.

It looks like in the particular case Brian presented dss.time was an IndexVariable which then was returned as a Variable by dt.daysinmonth in older (pre v2023.5.0) versions. This allowed the code to work previously whereas now it's returned as an IndexVariable and then fails because of that. So maybe a case of it accidentally working before.

spencerkclark commented 5 months ago

Yup, this is consistent with what I found. I should clarify, I'm not sure if the current behavior is intentional—it would be nice if the minimal example (and your more real-world use-case) worked.

max-sixty commented 5 months ago

Thanks @spencerkclark !


I updated the title — feel free to refine a bit more