rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.37k stars 893 forks source link

[FEA] Support Polars `strip_chars` expression #16225

Closed beckernick closed 1 month ago

beckernick commented 3 months ago

strip_chars (strip in pandas) is a common operation used during data cleaning.

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf

use_cudf = partial(execute_with_cudf, raise_on_fail=True) # for testing

df = pl.LazyFrame({"foo": [" hello", "\nworld"]})

print(df.with_columns(foo_stripped=pl.col("foo").str.strip_chars()).collect())
print(df.with_columns(foo_stripped=pl.col("foo").str.strip_chars()).collect(post_opt_callback=use_cudf))
shape: (2, 2)
┌────────┬──────────────┐
│ foo    ┆ foo_stripped │
│ ---    ┆ ---          │
│ str    ┆ str          │
╞════════╪══════════════╡
│  hello ┆ hello        │
│        ┆ world        │
│ world  ┆              │
└────────┴──────────────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[38], line 10
      7 df = pl.LazyFrame({"foo": [" hello", "\nworld"]})
      9 print(df.with_columns(foo_stripped=pl.col("foo").str.strip_chars()).collect())
---> 10 print(df.with_columns(foo_stripped=pl.col("foo").str.strip_chars()).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: String function StringFunction.StripChars
wence- commented 1 month ago

Done in #16504