rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.1k stars 875 forks source link

[BUG] DataFrame.groupby.describe differs between cudf and pandas #12263

Open mattf opened 1 year ago

mattf commented 1 year ago

Describe the bug

>>> import pandas as pd
>>> import cudf
>>> cudf.__version__
'22.12.00a+281.gcc4b4dd27c'
>>> data = {'a': ['b'], 'p': ['q'], 'n': [0]}
>>> pd.DataFrame(data).groupby('a').describe()
      n                                  
  count mean std  min  25%  50%  75%  max
a                                        
b   1.0  0.0 NaN  0.0  0.0  0.0  0.0  0.0
>>> cudf.DataFrame(data).groupby('a').describe()
      p             n                                  
  count min max count mean   std min  25%  50%  75% max
a                                                      
b     1   q   q     1  0.0  <NA>   0  0.0  0.0  0.0   0

Environment overview (please complete the following information) rapidsai/rapidsai-nightly:22.12-cuda11.5-runtime-rockylinux8-py3.9 on 29 nov 2022

wence- commented 1 year ago

So pandas' behaviour with groupby-agg is to drop any column from the dataframe where any requested agg is not supported. In contrast, cudf only drops the column if all aggs are unsupported. So this is probably easy to mimic the pandas behaviour (assuming that my description is complete in their treatment).