rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 908 forks source link

[FEA] Add support for `datetime32[D]` in cuDF python #16027

Open GregoryKimball opened 5 months ago

GregoryKimball commented 5 months ago

Is your feature request related to a problem? Please describe. Libcudf support 5 timestamp types, but cuDF python only supports 4.

libcudf type name libcudf physical type cuDF type Arrow type
TIMESTAMP_DAYS int32 n/a Date32Type
TIMESTAMP_SECONDS int64 datetime64[s] TimeUnit::SECOND
TIMESTAMP_MILLISECONDS int64 datetime64[ms] TimeUnit::MILLI
TIMESTAMP_MICROSECONDS int64 datetime64[us] TimeUnit::MICRO
TIMESTAMP_NANOSECONDS int64 datetime64[ns] TimeUnit::NANO

Describe the solution you'd like Adjust cuDF python type support to leverage all of the available libcudf types.

Additional context I was generating TPC-H files for Velox benchmarking, and cuDF python cannot write files with int32 timestamp days type. This means that Velox fails to execute scalar functions without additional query changes to cast from Velox TIMESTAMP to Velox DATE.

Error:
 terminate called after throwing an instance of 'facebook::velox::VeloxUserError'
  what():  Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Scalar function signature is not supported: between(TIMESTAMP, DATE, DATE). Supported signatures: (timestamp,timestamp,timestamp) -> boolean, (date,date,date) -> boolean, (real,real,real) -> boolean, (double,double,double) -> boolean, (bigint,bigint,bigint) -> boolean, (decimal(i1,i5),decimal(i1,i5),decimal(i1,i5)) -> boolean, (varchar,varchar,varchar) -> boolean, (integer,integer,integer) -> boolean, (smallint,smallint,smallint) -> boolean, (tinyint,tinyint,tinyint) -> boolean.
Retriable: False
Function: resolveScalarFunctionType
File: /nfs/repo/velox2/velox/parse/TypeResolver.cpp
Line: 99
mroeschke commented 5 months ago

Just noting that pandas does not support datetime32/64[D] and I don't think has plans to (because daily resolution logic gets murky with timezone data and DST), but should still be feasible for cudf Python

vyasr commented 5 months ago

That suggests to me that we want this support in pylibcudf, but not in cuDF Python. That wouldn't be as convenient for the original use case that @GregoryKimball was interested in though.