rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.34k stars 888 forks source link

[FEA] Support Polars `top_k` expression (group_by) #16222

Open beckernick opened 3 months ago

beckernick commented 3 months ago

The top_k expression can be used as a reduction on primitive types, but is often used alongside group_by and will return a nested type.

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf

use_cudf = partial(execute_with_cudf, raise_on_fail=True) # for testing

df = pl.LazyFrame(
    {
        'a': [1,1,2,2,3,3],
        'b':[1,2,3,4,5,6],
    }
)

print(df.group_by("a").agg(pl.col("b").top_k(1)).collect())
print(df.group_by("a").agg(pl.col("b").top_k(1)).collect(post_opt_callback=use_cudf))
shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [2]       │
│ 3   ┆ [6]       │
│ 2   ┆ [4]       │
└─────┴───────────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[31], line 15
      7 df = pl.LazyFrame(
      8     {
      9         'a': [1,1,2,2,3,3],
     10         'b':[1,2,3,4,5,6],
     11     }
     12 )
     14 print(df.group_by("a").agg(pl.col("b").top_k(1)).collect())
---> 15 print(df.group_by("a").agg(pl.col("b").top_k(1)).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: Unary function name='top_k'
wence- commented 3 months ago

WIP implementation in libcudf here: https://github.com/rapidsai/cudf/pull/15272