rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 907 forks source link

[FEA] Nth element support in dask cudf #6170

Open Salonijain27 opened 4 years ago

Salonijain27 commented 4 years ago

Requesting nth support in dask cudf after groupby. ex.

from cudf import DataFrame
import dask_cudf
df = DataFrame()
df['key'] = [1, 1, 1, 1, 2, 2, 2]
df['val_0']= [13, 15, 20, 27, 60, 17, 90]
df['val_1'] = [5, 1, 4, 9, 2, 7, 8]
meta_format = DataFrame()
ddf = dask_cudf.from_cudf(df, npartitions=1)
groups = ddf.groupby(['key']).nth(0)

Currently it fails with :

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/branch15/lib/python3.8/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1749         try:
-> 1750             return self[key]
   1751         except KeyError as e:

~/miniconda3/envs/branch15/lib/python3.8/site-packages/dask/dataframe/groupby.py in __getitem__(self, key)
   1735         # error is raised from pandas
-> 1736         g._meta = g._meta[key]
   1737         return g

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/groupby/groupby.py in __getitem__(self, key)
    623     def __getitem__(self, key):
--> 624         return self.obj[key].groupby(self.grouping, dropna=self._dropna)
    625

~/miniconda3/envs/branch15/lib/python3.8/contextlib.py in inner(*args, **kwds)
     74             with self._recreate_cm():
---> 75                 return func(*args, **kwds)
     76         return inner

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/dataframe.py in __getitem__(self, arg)
    640         if is_scalar(arg) or isinstance(arg, tuple):
--> 641             return self._get_columns_by_label(arg, downcast=True)
    642

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/frame.py in _get_columns_by_label(self, labels, downcast)
    466         """
--> 467         new_data = self._data.select_by_label(labels)
    468         if downcast:

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column_accessor.py in select_by_label(self, key)
    216                     return self._select_by_label_with_wildcard(key)
--> 217             return self._select_by_label_grouped(key)
    218

~/miniconda3/envs/branch15/lib/python3.8/site-packages/cudf/core/column_accessor.py in _select_by_label_grouped(self, key)
    264     def _select_by_label_grouped(self, key):
--> 265         result = self._grouped_data[key]
    266         if isinstance(result, cudf.core.column.ColumnBase):

KeyError: 'nth'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-2-9e488bd207ec> in <module>
      7 meta_format = DataFrame()
      8 ddf = dask_cudf.from_cudf(df, npartitions=1)
----> 9 groups = ddf.groupby(['key']).nth(1)

~/miniconda3/envs/branch15/lib/python3.8/site-packages/dask/dataframe/groupby.py in __getattr__(self, key)
   1750             return self[key]
   1751         except KeyError as e:
-> 1752             raise AttributeError(e) from e
   1753
   1754     @derived_from(pd.core.groupby.DataFrameGroupBy)

AttributeError: 'nth'

Once this functionality is implemented it should return a dask cudf dataframe that would contain the first row of each groupby

     val_0  val_1
key
1       13      5
2       60      2
kkraus14 commented 4 years ago

Is the expectation here that the order of rows within a group matches the order of rows from the input DataFrame? I'm not sure if we're able to make that guarantee currently.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.