pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

IndexError when using multi-variable BinGrouper #9630

Open phil-blain opened 4 days ago

phil-blain commented 4 days ago

What happened?

I tried using the new multi-dimensional grouping added in #9372, with one BinGrouper per dimension. I'm using version 2024.09.0. If I construct the BinGrouper such that some bins end up empty, I get an IndexError:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,2)))

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/util/deprecation_helpers.py:118, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    114     kwargs.update({name: arg for name, arg in zip_args})
    116     return func(*args[:-n_extra_args], **kwargs)
--> 118 return func(*args, **kwargs)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/dataset.py:10444, in Dataset.groupby(self, group, squeeze, restore_coord_dims, **groupers)
  10441 _validate_groupby_squeeze(squeeze)
  10442 rgroupers = _parse_group_and_groupers(self, group, groupers)
> 10444 return DatasetGroupBy(self, rgroupers, restore_coord_dims=restore_coord_dims)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/groupby.py:581, in GroupBy.__init__(self, obj, groupers, restore_coord_dims)
    573     if any(
    574         isinstance(obj._indexes.get(grouper.name, None), PandasMultiIndex)
    575         for grouper in groupers
    576     ):
    577         raise NotImplementedError(
    578             "Grouping by multiple variables, one of which "
    579             "wraps a Pandas MultiIndex, is not supported yet."
    580         )
--> 581     self.encoded = ComposedGrouper(groupers).factorize()
    583 # specification for the groupby operation
    584 # TODO: handle obj having variables that are not present on any of the groupers
    585 #       simple broadcasting fails for ExtensionArrays.
    586 (self.group1d, self._obj, self._stacked_dim, self._inserted_dims) = _ensure_1d(
    587     group=self.encoded.codes, obj=obj
    588 )

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/xarray/core/groupby.py:470, in ComposedGrouper.factorize(self)
    464 midx = pd.MultiIndex.from_product(
    465     (grouper.unique_coord.data for grouper in groupers),
    466     names=tuple(grouper.name for grouper in groupers),
    467 )
    468 # Constructing an index from the product is wrong when there are missing groups
    469 # (e.g. binning, resampling). Account for that now.
--> 470 midx = midx[np.sort(pd.unique(_flatcodes[~mask]))]
    472 full_index = pd.MultiIndex.from_product(
    473     (grouper.full_index.values for grouper in groupers),
    474     names=tuple(grouper.name for grouper in groupers),
    475 )
    476 dim_name = "stacked_" + "_".join(str(grouper.name) for grouper in groupers)

File /home/me/.conda/envs/xarray_2024.09/lib/python3.12/site-packages/pandas/core/indexes/multi.py:2207, in MultiIndex.__getitem__(self, key)
   2204 elif isinstance(key, Index):
   2205     key = np.asarray(key)
-> 2207 new_codes = [level_codes[key] for level_codes in self.codes]
   2209 return MultiIndex(
   2210     levels=self.levels,
   2211     codes=new_codes,
   (...)
   2214     verify_integrity=False,
   2215 )

IndexError: index 18 is out of bounds for axis 0 with size 18

What did you expect to happen?

It should work, even if some bins are empty, just like it works correctly for a single dimension.

Minimal Complete Verifiable Example

In [1]: ds = xr.Dataset(
   ...:         {"foo": (("z"), np.random.random_sample(12))},
   ...:         coords={"x": ("z", np.arange(12)), "y": ("z", np.arange(12))},
   ...:     )
In [2]: from xarray.groupers import BinGrouper
In [3]: ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,2)))

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

If we make sure that no bins are empty, it works, e.g.

ds.groupby(x=BinGrouper(np.arange(0,13,4)), y=BinGrouper(bins=np.arange(0,16,4)))

Also, if we give the same bins as above, but only for a single dimension, it also works:

ds.groupby(y=BinGrouper(bins=np.arange(0,16,2)))

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 4.18.0-372.9.1.el8.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.4 libnetcdf: 4.9.2 xarray: 2024.9.0 pandas: 2.2.3 numpy: 2.1.2 scipy: 1.14.1 netCDF4: 1.7.1 pydap: None h5netcdf: None h5py: None zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2024.9.1 distributed: 2024.9.1 matplotlib: 3.9.2 cartopy: None seaborn: None numbagg: None fsspec: 2024.9.0 cupy: None pint: None sparse: None flox: 0.9.12 numpy_groupies: 0.11.2 setuptools: 75.1.0 pip: 24.2 conda: None pytest: None mypy: None IPython: 8.28.0 sphinx: None
phil-blain commented 1 day ago

@dcherian