pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

groupby(multi-index level) not working correctly on a multi-indexed DataArray or DataSet #6836

Closed emmaai closed 1 year ago

emmaai commented 2 years ago

What happened?

run the code block below with 2022.6.0

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

output:

In [15]: mda.groupby("one").groups
Out[15]: 
{('a', 0): [0],
 ('a', 1): [1],
 ('b', 0): [2],
 ('b', 1): [3],
 ('c', 0): [4],
 ('c', 1): [5]}

What did you expect to happen?

as it was with 2022.3.0

In [6]: mda.groupby("one").groups
Out[6]: {'a': [0, 1], 'b': [2, 3], 'c': [4, 5]}

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import xarray as XR

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))

mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])

mda.groupby("one").groups

MVCE confirmation

Relevant log output

N/A

Anything else we need to know?

N/A

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 5.11.0-1025-aws machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4 xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.22.4 scipy: 1.7.3 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: 1.3.2 dask: 2022.04.1 distributed: 2022.4.1 matplotlib: 3.5.1 cartopy: 0.20.3 seaborn: 0.11.2 numbagg: None fsspec: 2022.01.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 45.2.0 pip: 22.2 conda: None pytest: 7.1.2 IPython: 7.31.0 sphinx: None
dcherian commented 2 years ago

@benbovy I tracked this down to

>>> mda.one.to_index()
# v2022.06.0
MultiIndex([('a', 0),
            ('a', 1),
            ('b', 0),
            ('b', 1),
            ('c', 0),
            ('c', 1)],
           names=['one', 'two'])

# v2022.03.0
Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object', name='x')

We call to_index here in safe_cast_to_index: https://github.com/pydata/xarray/blob/f8fee902360f2330ab8c002d54480d357365c172/xarray/core/utils.py#L115-L140

Not sure if the fix should be only in the GroupBy specifically or more generally in safe_cast_to_index

The GroupBy context is https://github.com/pydata/xarray/blob/f8fee902360f2330ab8c002d54480d357365c172/xarray/core/groupby.py#L434

FabianHofmann commented 2 years ago

After trying to dig down further into the code, I saw that grouping over levels seems to be broken generally (up-to-date main branch at time of writing), i.e.

import pandas as pd
import numpy as np
import xarray as xr

midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two"))
mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))])
mda.groupby("one").sum()

raises:


  File ".../xarray/xarray/core/_reductions.py", line 5055, in sum
    return self.reduce(

  File ".../xarray/xarray/core/groupby.py", line 1191, in reduce
    return self.map(reduce_array, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1095, in map
    return self._combine(applied, shortcut=shortcut)

  File ".../xarray/xarray/core/groupby.py", line 1127, in _combine
    index, index_vars = create_default_index_implicit(coord)

  File ".../xarray/xarray/core/indexes.py", line 974, in create_default_index_implicit
    index = PandasMultiIndex(array, name)

  File ".../xarray/xarray/core/indexes.py", line 552, in __init__
    raise ValueError(

ValueError: conflicting multi-index level name 'one' with dimension 'one'

in the function create_default_index_implicit. I am still a bit puzzled how to approach this. Any idea @benbovy?

benbovy commented 2 years ago

Thanks @emmaai for the issue report and thanks @dcherian and @FabianHofmann for tracking it down.

There is a lot of complexity related to pandas.MultiIndex special cases and it's been difficult to avoid new issues arising during the index refactor.

create_default_index_implicit has some hacks to create xarray objects directly from pandas.MultiIndex instances (e.g., xr.Dataset(coords={"x": pd_midx})) or even from xarray objects wrapping multi-indexes. The error raised here suggests that the issue should fixed before this call... Probably in safe_cast_to_index indeed.

We should probably avoid using .to_index() internally, or should we even deprecate it? The fact that mda.one.to_index() (in v2022.3.0) doesn't return the same result than mda.indexes["one"] adds more confusion than it adds value. Actually, in the long-term I'd be for deprecating all pandas.MultiIndex special cases in Xarray.

benbovy commented 1 year ago

From #7282 it looks like we need to convert the multi-index level to a single index when casting the group to an index. And from #7105 we can fix that in safe_cast_to_index() (sometimes the full multi-index is expected) so we probably need a special case in groupby.

benbovy commented 1 year ago

we can fix that in safe_cast_to_index()

...we cannot fix that in safe_cast_to_index() (or we can add a parameter to specify the desired result).

mschrimpf commented 1 year ago

Is there hope for groupby working on multi-indexed DataArrays again in the future? We -- and from the issue history it looks like others too -- are currently pinning xarray<2022.6 even though we would love to use newer versions.

dcherian commented 1 year ago

I think we could special-case extracting a multiindex level here: https://github.com/pydata/xarray/blob/d4db16699f30ad1dc3e6861601247abf4ac96567/xarray/core/groupby.py#L469

group at that stage should have values

['a', 'a', 'b', 'b', 'c', 'c']

@mschrimpf Can you try that and send in a PR?

benbovy commented 1 year ago

A special-case sounds reasonable to me as well as a temporary fix before looking into if/how we can refactor groupby so that it works with multiple kinds of built-in and/or custom indexes.

emmaai commented 1 year ago

I solved it temporarily by reset_index to groupby and set_xindex after, if anyone is looking.

carynbear commented 2 months ago

This is happening again :(

max-sixty commented 2 months ago

@carynbear please could you post a reproducible example? A new bug report would be ideal...