rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.03k stars 871 forks source link

[BUG] Support string to datetime conversion in Polars engine #16174

Open beckernick opened 1 week ago

beckernick commented 1 week ago

It looks like our string to datetime utilities throws an error. This is a fairly common step while cleaning datasets, so it'd be nice to support it:


import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

use_cudf = partial(execute_with_cudf, raise_on_fail=True)

ldf = pl.DataFrame({
    "date": ['2015-09-11', '2017-02-08', '2015-08-01', '2019-03-16', '2015-05-15'],
    "val": [1, 2, 3, 4, 5]
}).lazy()

print(ldf.with_columns(pl.col("date").str.to_datetime()).collect())
print(ldf.with_columns(pl.col("date").str.to_datetime()).collect(post_opt_callback=use_cudf))
shape: (5, 2)
┌─────────────────────┬─────┐
│ date                ┆ val │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2015-09-11 00:00:00 ┆ 1   │
│ 2017-02-08 00:00:00 ┆ 2   │
│ 2015-08-01 00:00:00 ┆ 3   │
│ 2019-03-16 00:00:00 ┆ 4   │
│ 2015-05-15 00:00:00 ┆ 5   │
└─────────────────────┴─────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[141], line 14
     12 ldf.sink_parquet("test.parquet")
     13 print(ldf.with_columns(pl.col("date").str.to_datetime()).collect())
---> 14 print(ldf.with_columns(pl.col("date").str.to_datetime()).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: String function StringFunction.Strptime
beckernick commented 1 week ago

Ah, the MRE does fail. I had a typo. Editing the issue to make it clear.