rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
7.98k stars 866 forks source link

[BUG] Python groupby rolling aggregations return index inconsistent with pandas #10249

Open beckernick opened 2 years ago

beckernick commented 2 years ago

Python groupby rolling aggregations return a single Index that corresponds to the original row position of the element, but in pandas return a MultiIndex that includes both the groupby key(s) and original row position.

This is not currently blocking any behavior with Dask + cuDF, as grouped rolling operations are blocked by #10173

import pandas as pd
import cudf
import numpy as np
​
df = cudf.datasets.randomdata(nrows=100000)
pdf = df.to_pandas()
​
print(pdf.groupby(['id']).rolling(window=3).x.mean().head())
print(df.groupby(['id']).rolling(window=3).x.mean().head())
id        
879  43605   NaN
881  3941    NaN
882  29855   NaN
884  14616   NaN
     70864   NaN
Name: x, dtype: float64
43605    <NA>
3941     <NA>
29855    <NA>
14616    <NA>
70864    <NA>
Name: x, dtype: float64
github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

beckernick commented 11 months ago

Took a quick look at this issue again. I suspect that the reason this occurs is due to the __getitem__ call (.x). RollingGroupby inherits from Rolling, which has the getitem implementation.

This implementation returns a standard Rolling object:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L204-L211

So the call to mean() goes down the standard Rolling._apply_agg path rather than the RollingGroupby._apply_agg. This path doesn't create a MultiIndex:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L264-L284

Unlike the RollingGroupby._apply_agg path:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L545-L559

beckernick commented 11 months ago

To make this concrete:

import cudf

df = cudf.datasets.randomdata()

print("getitem call on the RollingGroupby object")
print(df.groupby(["id"]).rolling(window=3).x.mean().head())

print("getitem call before the RollingGroupby is created")
print(df.groupby(["id"]).x.rolling(window=3).mean().head())
getitem call on the RollingGroupby object
3    <NA>
8    <NA>
6    <NA>
7    <NA>
9    <NA>
Name: x, dtype: float64
getitem call before the RollingGroupby is created
id    
918  3    <NA>
963  8    <NA>
967  6    <NA>
990  7    <NA>
     9    <NA>
Name: x, dtype: float64
beckernick commented 1 month ago

Ran into this today while using cudf.pandas.

cc @galipremsagar @vyasr , would this be a good first issue for a new contributor?

vyasr commented 1 month ago

Yes, I think so!