rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 908 forks source link

[FEA]: Support groupby.resample #12935

Open mroeschke opened 1 year ago

mroeschke commented 1 year ago

Is your feature request related to a problem? Please describe. While working on a pandas to cudf workflow comparison, I noticed that groupby(...).resample(...) has not been implemented in cudf yet

Describe the solution you'd like

In [26]: from datetime import datetime

In [27]: import pandas as pd

In [28]: import cudf

In [29]: data = {"group": list("abab"), "values": range(4), "ts": [datetime(2023,
    ...:  1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3), datetime(2023, 1, 4)
    ...: ]}

In [30]: df = pd.DataFrame(data)

In [31]: df.groupby("group").resample("D", on="ts")["values"].mean()
Out[31]:
group  ts
a      2023-01-01    0.0
       2023-01-02    NaN
       2023-01-03    2.0
b      2023-01-02    1.0
       2023-01-03    NaN
       2023-01-04    3.0
Name: values, dtype: float64

In [32]: cu_df = cudf.DataFrame(data)

In [33]: cu_df.groupby("group").resample("D", on="ts")["values"].mean()
KeyError: 'resample'

During handling of the above exception, another exception occurred:

AttributeError: DataFrameGroupBy object has no attribute resample

Describe alternatives you've considered Can for loop over the groups of cu_df.groupby("group") and call resample individually.

Additional context https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.resample.html

shwina commented 1 year ago

Perhaps related: https://github.com/rapidsai/cudf/pull/12882

wence- commented 1 year ago

Resampling is somewhat different, I think. One possible implementation might be to produce the grouped dataframe and then call DataFrame.resample on the whole thing (since I think that the resampling commutes with grouping)