rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
7.99k stars 866 forks source link

[BUG] Groupby rolling window aggregations on time windows fail for SeriesGroupby #10175

Open beckernick opened 2 years ago

beckernick commented 2 years ago

For time windows, grouped rolling window aggregations currently fail for SeriesGroupby objects, but succeed on DataFrameGroupby objects.

import dask
import cudf
​
df = cudf.datasets.timeseries()
​
print(df.groupby("name").rolling('1D').mean().head())
​
print(df.groupby("name").x.rolling('1D').mean().head())
                                    id         x         y
name  timestamp                                           
Alice 2000-01-01 00:00:36   992.000000 -0.875947  0.624249
      2000-01-01 00:01:51  1008.000000 -0.380349  0.473271
      2000-01-01 00:02:30   998.333333 -0.472583  0.249997
      2000-01-01 00:02:59   994.750000 -0.396048  0.408755
      2000-01-01 00:03:02   998.400000 -0.483499  0.170219
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [28], in <module>
      4 df = cudf.datasets.timeseries()
      6 print(df.groupby("name").rolling('1D').mean().head())
----> 7 print(df.groupby("name").x.rolling('1D').mean().head())

File ~/conda/envs/cudf-22.04/lib/python3.8/site-packages/cudf/core/groupby/groupby.py:729, in GroupBy.rolling(self, *args, **kwargs)
    720 def rolling(self, *args, **kwargs):
    721     """
    722     Returns a `RollingGroupby` object that enables rolling window
    723     calculations on the groups.
   (...)
    727     cudf.core.window.Rolling
    728     """
--> 729     return cudf.core.window.rolling.RollingGroupby(self, *args, **kwargs)

File ~/conda/envs/cudf-22.04/lib/python3.8/site-packages/cudf/core/window/rolling.py:440, in RollingGroupby.__init__(self, groupby, window, min_periods, center)
    435 gb_size = groupby.size().sort_index()
    436 self._group_starts = (
    437     gb_size.cumsum().shift(1).fillna(0).repeat(gb_size)
    438 )
--> 440 super().__init__(obj, window, min_periods=min_periods, center=center)

File ~/conda/envs/cudf-22.04/lib/python3.8/site-packages/cudf/core/window/rolling.py:179, in Rolling.__init__(self, obj, window, min_periods, center, axis, win_type)
    177 self.min_periods = min_periods
    178 self.center = center
--> 179 self._normalize()
    180 self.agg_params = {}
    181 if axis != 0:

File ~/conda/envs/cudf-22.04/lib/python3.8/site-packages/cudf/core/window/rolling.py:375, in Rolling._normalize(self)
    372     return
    374 if not isinstance(self.obj.index, cudf.core.index.DatetimeIndex):
--> 375     raise ValueError(
    376         "window must be an integer for non datetime index"
    377     )
    379 self._time_window = True
    381 try:

ValueError: window must be an integer for non datetime index

We do not see this behavior for grouped rolling window aggregations not using time windows.

import dask
import cudf
​
df = cudf.datasets.timeseries().reset_index(drop=True)
df.groupby("name").rolling(window=3).x.sum().head()
48            <NA>
71            <NA>
86     1.067738088
99     1.142080475
103    0.702573636
Name: x, dtype: float64

Env:

conda list | grep "rapids\|dask" cudf 22.04.00a220131 cuda_11_py38_gc25d35b361_93 rapidsai-nightly dask 2022.1.0 pyhd8ed1ab_0 conda-forge dask-core 2022.1.0 pyhd8ed1ab_0 conda-forge dask-cudf 22.04.00a220131 cuda_11_py38_gc25d35b361_93 rapidsai-nightly libcudf 22.04.00a220131 cuda11_gc25d35b361_93 rapidsai-nightly librmm 22.04.00a220131 cuda11_g81d523a_15 rapidsai-nightly ptxcompiler 0.2.0 py38h98f4b32_0 rapidsai-nightly rmm 22.04.00a220131 cuda11_py38_g81d523a_15_has_cma rapidsai-nightly

Perhaps relevant to https://github.com/rapidsai/cudf/issues/10173

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.