Describe the bug
When attempting to use the cumulative aggregation functions on a groupby of a frame or series with nulls (i.e. df.groupby("x").cumsum()), we get a ValueError.
Steps/Code to reproduce bug
Note we would see the same traceback if we used cumcount, cummax, or cummin:
ValueError Traceback (most recent call last)
Cell In [3], line 16
11 gdf = gdf.mask(cp.random.choice([True, False], size=gdf.shape))
13 gdf_grouped = gdf.groupby("xx")
---> 16 gdf_grouped.xx.cumsum()
17 gdf_grouped.cumcount()
18 gdf_grouped.cummax()
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2)
10 def wrapper(self, *args2, **kwargs2):
---> 11 return method(self, *args1, *args2, **kwargs1, **kwargs2)
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:536, in GroupBy._scan(self, op, *args, **kwargs)
534 def _scan(self, op: str, *args, **kwargs):
535 """{op_name} for each group."""
--> 536 return self.agg(op)
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1749, in SeriesGroupBy.agg(self, func)
1748 def agg(self, func):
-> 1749 result = super().agg(func)
1751 # downcast the result to a Series:
1752 if len(result._data):
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:490, in GroupBy.agg(self, func)
487 result = result.reset_index()
488 if libgroupby._is_all_scan_aggregate(normalized_aggs):
489 # Scan aggregations return rows in original index order
--> 490 return self._mimic_pandas_order(result)
492 return result
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:1724, in GroupBy._mimic_pandas_order(self, result)
1722 gather_map = order_cols[0].argsort()
1723 result = result.take(gather_map)
-> 1724 result.index = self.obj.index
1725 return result
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/dataframe.py:1091, in DataFrame.__setattr__(self, key, col)
1089 super().__setattr__(key, col)
1090 else:
-> 1091 super().__setattr__(key, col)
File /raid/charlesb/mambaforge/envs/checkout-groupby-layers/lib/python3.9/site-packages/cudf/core/indexed_frame.py:533, in IndexedFrame.index(self, value)
531 # A DataFrame with 0 columns can have an index of arbitrary length.
532 if len(self._data) > 0 and new_length != old_length:
--> 533 raise ValueError(
534 f"Length mismatch: Expected axis has {old_length} elements, "
535 f"new values have {len(value)} elements"
536 )
537 self._index = Index(value)
ValueError: Length mismatch: Expected axis has 5032 elements, new values have 10000 elements
Expected behavior
I would expect these operations to succeed and give me something roughly similar to the output of pandas:
In [2]: gdf.to_pandas().groupby("xx").cumsum()
Out[2]:
x y
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
9995 15.674543 13.537169
9996 -22.231675 NaN
9997 NaN -19.599722
9998 NaN NaN
9999 NaN NaN
[10000 rows x 2 columns]
Environment overview (please complete the following information)
Describe the bug When attempting to use the cumulative aggregation functions on a groupby of a frame or series with nulls (i.e.
df.groupby("x").cumsum()
), we get aValueError
.Steps/Code to reproduce bug Note we would see the same traceback if we used
cumcount
,cummax
, orcummin
:Expected behavior I would expect these operations to succeed and give me something roughly similar to the output of pandas:
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context Ran into this issue while adding null testing to dask-cudf's groupby tests in https://github.com/rapidsai/cudf/pull/10853.