rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.27k stars 884 forks source link

[FEA] Support Polars pearson correlation expression #16220

Open beckernick opened 2 months ago

beckernick commented 2 months ago

We should support Pearson correlation in the Polars executor for both columns and grouped objects.

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf

use_cudf = partial(execute_with_cudf, raise_on_fail=True) # for testing

df = pl.LazyFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [1, 2, -3, -4]})

print(df.select(pl.corr(pl.col('b'), pl.col('c'))).collect())
print(df.select(pl.corr(pl.col('b'), pl.col('c'))).collect(post_opt_callback=use_cudf))
shape: (1, 1)
┌───────────┐
│ b         │
│ ---       │
│ f64       │
╞═══════════╡
│ -0.877058 │
└───────────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[21], line 10
      7 df = pl.LazyFrame({'a': [1, 1, 2, 2], 'b': [1, 2, 3, 4], 'c': [1, 2, -3, -4]})
      9 print(df.select(pl.corr(pl.col('b'), pl.col('c'))).collect())
---> 10 print(df.select(pl.corr(pl.col('b'), pl.col('c'))).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: corr
lithomas1 commented 2 months ago

A portion of this needs to be upstreamed to polars, as the lowering here https://github.com/pola-rs/polars/blob/3629ea28dda72f5e08e0891fd591b11c92e3fe7c/py-polars/src/lazyframe/visitor/expr_nodes.rs#L1189-L1191 is not implemented

beckernick commented 2 months ago

Makes sense. Would you be able to file an issue upstream?

lithomas1 commented 2 months ago

Makes sense. Would you be able to file an issue upstream?

Since we're the only consumers of the Python version of the IR, it's probably best for someone on the cudf side to do the changes. (so we can make sure the new IR works for cudf_polars).

So I think here is probably better.