pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

Concatenate using Multiindex cannot be unstacked anymore #7148

Open lpilz opened 1 year ago

lpilz commented 1 year ago

What happened?

When trying to concatenate data using a Pandas MultiIndex and then unstack it to get two independent dimensions (e.g. for varying different parameters in a simulation), the unstack errors. I have seen different errors with different data (MVE errors with ValueError: IndexVariable objects must be 1-dimensional, but my data errors with ValueError: cannot re-index or align objects with conflicting indexes found for the following dimensions: 'concat_dim' (2 conflicting indexes)).

One hint at the bug might be that conc._indexes shows more indexes then display(conc).

What did you expect to happen?

Originally (I think it was v2022.3.0) , it used to unstack neatly into the two levels of the multiindex as separate dimensions.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np
import pandas as pd

ds = xr.Dataset(data_vars={"a": (("dim1", "dim2"), np.arange(16).reshape(4,4))}, coords={"dim1": list(range(4)), "dim2": list(range(2,6))})
dslist = [ds for i in range(6)]

arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo"],
    ["one", "two", "one", "two", "one", "two"],
]
mindex = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["first", "second"])

conc = xr.concat(dslist, dim=mindex)
conc.unstack("concat_dim") # this errors

conc = xr.concat(dslist, dim='concat_dim')
conc = conc.assign_coords(dict(concat_dim=mindex)).unstack("concat_dim") # this does not

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

[Skip to left side bar](https://jupyterhub.dkrz.de/user/b381219/levante-spawner-advanced/lab/tree/home/b/b381219/software/phd_scripts/jupyter/Test.ipynb#) > / /phd_scripts/jupyter/ Name Last Modified import xarray as xr import numpy as np import pandas as pd ​ ds = xr.Dataset(data_vars={"a": (("dim1", "dim2"), np.arange(16).reshape(4,4))}, coords={"dim1": list(range(4)), "dim2": list(range(2,6))}) dslist = [ds for i in range(6)] ​ arrays = [ ["bar", "bar", "baz", "baz", "foo", "foo"], ["one", "two", "one", "two", "one", "two"], ] mindex = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["first", "second"]) ​ conc = xr.concat(dslist, dim=mindex) conc.unstack("concat_dim") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In [24], line 15 12 mindex = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=["first", "second"]) 14 conc = xr.concat(dslist, dim=mindex) ---> 15 conc.unstack("concat_dim") File ~/.conda/envs/xwrf-dev/lib/python3.10/site-packages/xarray/core/dataset.py:4870, in Dataset.unstack(self, dim, fill_value, sparse) 4866 result = result._unstack_full_reindex( 4867 d, stacked_indexes[d], fill_value, sparse 4868 ) 4869 else: -> 4870 result = result._unstack_once(d, stacked_indexes[d], fill_value, sparse) 4871 return result File ~/.conda/envs/xwrf-dev/lib/python3.10/site-packages/xarray/core/dataset.py:4706, in Dataset._unstack_once(self, dim, index_and_vars, fill_value, sparse) 4703 else: 4704 fill_value_ = fill_value -> 4706 variables[name] = var._unstack_once( 4707 index=clean_index, 4708 dim=dim, 4709 fill_value=fill_value_, 4710 sparse=sparse, 4711 ) 4712 else: 4713 variables[name] = var File ~/.conda/envs/xwrf-dev/lib/python3.10/site-packages/xarray/core/variable.py:1764, in Variable._unstack_once(self, index, dim, fill_value, sparse) 1759 # Indexer is a list of lists of locations. Each list is the locations 1760 # on the new dimension. This is robust to the data being sparse; in that 1761 # case the destinations will be NaN / zero. 1762 data[(..., *indexer)] = reordered -> 1764 return self._replace(dims=new_dims, data=data) File ~/.conda/envs/xwrf-dev/lib/python3.10/site-packages/xarray/core/variable.py:1017, in Variable._replace(self, dims, data, attrs, encoding) 1015 if encoding is _default: 1016 encoding = copy.copy(self._encoding) -> 1017 return type(self)(dims, data, attrs, encoding, fastpath=True) File ~/.conda/envs/xwrf-dev/lib/python3.10/site-packages/xarray/core/variable.py:2776, in IndexVariable.__init__(self, dims, data, attrs, encoding, fastpath) 2774 super().__init__(dims, data, attrs, encoding, fastpath) 2775 if self.ndim != 1: -> 2776 raise ValueError(f"{type(self).__name__} objects must be 1-dimensional") 2778 # Unlike in Variable, always eagerly load values into memory 2779 if not isinstance(self._data, PandasIndexingAdapter): ValueError: IndexVariable objects must be 1-dimensional conc = xr.concat(dslist, dim='concat_dim') conc = conc.assign_coords(dict(concat_dim=index)).unstack("concat_dim") conc xarray.Dataset Dimensions: first: 3second: 2dim1: 4dim2: 4 Coordinates: first (first) object 'bar' 'baz' 'foo' second (second) object 'one' 'two' dim1 (dim1) int64 0 1 2 3 dim2 (dim2) int64 2 3 4 5 Data variables: a (dim1, dim2, first, second) int64 0 0 0 0 0 0 1 ... 15 15 15 15 15 15 Attributes: (0) xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 4.18.0-305.25.1.el8_4.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 2022.9.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2022.9.2 distributed: 2022.9.2 matplotlib: 3.6.0 cartopy: 0.21.0 seaborn: None numbagg: None fsspec: 2022.8.2 cupy: None pint: 0.19.2 sparse: None flox: None numpy_groupies: None setuptools: 65.4.1 pip: 22.2.2 conda: None pytest: None IPython: 8.5.0 sphinx: None
mathause commented 1 year ago

Thanks for the report. There was a recent release (2022.09) with many Index related bugfixes (including MuliIndex) - could you test this?

lpilz commented 1 year ago

Thanks for the quick response. This is actually also an issue on 2022.9.0, I just had the wrong kernel selected. -> Updated the show_version output

benbovy commented 1 year ago

Looks like passing a pandas.MultiIndex object as dim argument to concat was forgotten during the explicit indexes refactor. While this can be fixed (could be tricky), we should deprecate it: it is convenient but probably too neat now that multi-indexes levels have their own, "real" coordinates (see https://github.com/pydata/xarray/issues/6293#issuecomment-1259228475). It should be preferred to explicitly chain concat with assign_coords (and set_index) like the last line in your example.