rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 884 forks source link

[FEA] dask-cudf groupby with quantile and median methods #4706

Open rnyak opened 4 years ago

rnyak commented 4 years ago

Is your feature request related to a problem? Please describe. I'd like to calculate median and/or quantile on a column after groupbying a dask-cudf data frame.

EDIT 5/10/2024: median is now implemented

Describe the solution you'd like I want the following code to work and generate correct results:

cdf =cudf.DataFrame({'id4': 4*list(range(6)), 'id5': 4*list(reversed(range(6))), 'v3': 6*list(range(4))})
ddf = dcu.from_cudf(cdf, npartitions= 1)

ddf.dtypes
id4    int64
id5    int64
v3     int64
dtype: object

ddf.head()

        id4   id5     v3

0   0   5   0
1   1   4   1
2   2   3   2
3   3   2   3
4   4   1   0

#these groupby operations do not work
ans = ddf.groupby(['id4', 'id5'])[['v3']].median().compute()

OR 

ans = ddf.groupby(['id4', 'id5'])[['v3']].quantile(q=0.5).compute()

Additional context I am using Rapids 0.13 nightly release in conda env, with dask 2.12.0 version.

rnyak commented 4 years ago

@taureandyernv fyi.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

randerzander commented 3 years ago

Still a desired [FEA]

beckernick commented 2 years ago

Ideally, this would also be enabled by an implementation on dask.dataframe, so we can enable group quantile/percentile on CPUs and GPUs

vyasr commented 4 months ago

It looks like median is now implemented

In [27]: ddf.groupby(['id4', 'id5'])[['v3']].median().compute()
Out[27]: 
          v3
id4 id5     
0   5    1.0
1   4    2.0
2   3    1.0
3   2    2.0
4   1    1.0
5   0    2.0

quantile remains unimplemented.