rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 901 forks source link

[BUG] Series/Single Column DataFrame Groupby value_counts fails (DataFrame Groupby value_counts succeeds) #15696

Open beckernick opened 6 months ago

beckernick commented 6 months ago

Groupby value_counts fails on when selecting individual columns from a DataFrame, but succeeds when running on the entire DataFrame.

import pandas as pd
import cudf

gdf = cudf.datasets.randomdata(dtypes={"id": int, "x": int})
pdf = gdf.to_pandas()

print(pdf.groupby("id").x.value_counts().head())
print(gdf.groupby("id").x.value_counts())
id   x   
942  988     1
961  1026    1
965  1062    1
984  981     1
993  999     1
Name: count, dtype: int64
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2783](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2782), in _Grouping._handle_by_or_level(self, by, level)
   2782 try:
-> 2783     self._handle_label(by)
   2784 except (KeyError, TypeError):

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2845](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2844), in _Grouping._handle_label(self, by)
   2844     else:
-> 2845         raise e
   2846 self.names.append(by)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2839](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2838), in _Grouping._handle_label(self, by)
   2838 try:
-> 2839     self._key_columns.append(self._obj._data[by])
   2840 except KeyError as e:
   2841     # `by` can be index name(label) too.

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column_accessor.py:155](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column_accessor.py#line=154), in ColumnAccessor.__getitem__(self, key)
    154 def __getitem__(self, key: Any) -> ColumnBase:
--> 155     return self._data[key]

KeyError: 'id'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[31], line 8
      5 pdf = gdf.to_pandas()
      7 print(pdf.groupby("id").x.value_counts().head())
----> 8 print(gdf.groupby("id").x.value_counts())

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2598](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2597), in GroupBy.value_counts(self, subset, normalize, sort, ascending, dropna)
   2591     raise ValueError(
   2592         f"Keys {set(subset) & set(groupings)} in subset "
   2593         "cannot be in the groupby column keys."
   2594     )
   2596 df["__placeholder"] = 1
   2597 result = (
-> 2598     df.groupby(groupings + list(subset), dropna=dropna)[
   2599         "__placeholder"
   2600     ]
   2601     .count()
   2602     .sort_index()
   2603     .astype(np.int64)
   2604 )
   2606 if normalize:
   2607     levels = list(range(len(groupings), result.index.nlevels))

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/series.py:3426](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/series.py#line=3425), in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   3400 @_cudf_nvtx_annotate
   3401 @docutils.doc_apply(
   3402     groupby_doc_template.format(
   (...)
   3424     dropna=True,
   3425 ):
-> 3426     return super().groupby(
   3427         by,
   3428         axis,
   3429         level,
   3430         as_index,
   3431         sort,
   3432         group_keys,
   3433         squeeze,
   3434         observed,
   3435         dropna,
   3436     )

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/indexed_frame.py:5337](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/indexed_frame.py#line=5336), in IndexedFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   5331 if group_keys is None:
   5332     group_keys = False
   5334 return (
   5335     self.__class__._resampler(self, by=by)
   5336     if isinstance(by, cudf.Grouper) and by.freq
-> 5337     else self.__class__._groupby(
   5338         self,
   5339         by=by,
   5340         level=level,
   5341         as_index=as_index,
   5342         dropna=dropna,
   5343         sort=sort,
   5344         group_keys=group_keys,
   5345     )
   5346 )

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:283](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=282), in GroupBy.__init__(self, obj, by, level, sort, as_index, dropna, group_keys)
    281     self.grouping = self._by
    282 else:
--> 283     self.grouping = _Grouping(obj, self._by, level)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2751](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2750), in _Grouping.__init__(self, obj, by, level)
   2748 # Need to keep track of named key columns
   2749 # to support `as_index=False` correctly
   2750 self._named_columns = []
-> 2751 self._handle_by_or_level(by, level)
   2753 if len(obj) and not len(self._key_columns):
   2754     raise ValueError("No group keys passed")

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2785](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2784), in _Grouping._handle_by_or_level(self, by, level)
   2783     self._handle_label(by)
   2784 except (KeyError, TypeError):
-> 2785     self._handle_misc(by)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2868](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2867), in _Grouping._handle_misc(self, by)
   2866 by = cudf.core.column.as_column(by)
   2867 if len(by) != len(self._obj):
-> 2868     raise ValueError("Grouper and object must have same length")
   2869 self._key_columns.append(by)
   2870 self.names.append(None)

ValueError: Grouper and object must have same length
print(gdf.groupby("id").value_counts()) # succeeds
# print(gdf.groupby("id")[["x"]].value_counts()) # same error as above
mroeschke commented 6 months ago

It appears that groupby.value_counts is only properly implemented for DataFrameGroupby. In pandas value_counts has a different signature depending on whether the resulting grouped object is a Series or DataFrame.

# DataFrameGroupby
    def value_counts(
        self,
        subset: Sequence[Hashable] | None = None,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        dropna: bool = True,
    ) -> DataFrame | Series:
# SeriesGroupby
    def value_counts(
        self,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        bins=None,
        dropna: bool = True,
    ) -> Series | DataFrame: