Closed galipremsagar closed 2 months ago
Maybe a related PR previously worked on similar issue: https://github.com/rapidsai/cudf/pull/11854
This seems to be a problem where libcudf is not writing datetime64[s]
or timedelta64[s]
correctly. My testing shows that libcudf is also not roundtripping it faithfully:
import pyarrow as pa
import cudf
for type in [
'timedelta64[s]',
'timedelta64[ms]',
'timedelta64[us]',
'timedelta64[ns]',
'datetime64[s]',
'datetime64[ms]',
'datetime64[us]',
'datetime64[ns]',
]:
df = cudf.DataFrame({"s": cudf.Series([1234, 3456, 32442], dtype=type)})
df.to_parquet("a")
df2 = cudf.read_parquet("a")
df3 = pa.parquet.read_table("a")
print(df['s'].dtype, df2['s'].dtype, df3['s'].type)
output timedelta64[s] timedelta64[ms] time32[ms] timedelta64[ms] timedelta64[ms] time32[ms] timedelta64[us] timedelta64[us] time64[us] timedelta64[ns] timedelta64[ns] time64[ns] datetime64[s] datetime64[ms] timestamp[ms] datetime64[ms] datetime64[ms] timestamp[ms] datetime64[us] datetime64[us] timestamp[us] datetime64[ns] datetime64[ns] timestamp[ns]
Investigation Notes:
SECONDS
is not as a valid TimeUnit
in Parquet and hence converted to milliseconds by both cudf and arrow.SECONDS
to our TimeUnit
enum class. It round-trips correctly with cudf
but produces an error when read with pyarrow's parquet reader (invalid unit)timedelta
actually corresponds to Arrow's duration
type instead of time
type as seen with cudf's to_arrow
and from_arrow
functions. However, it is not yet possible to convert between timedelta64
and duration
by only using Parquet spec. duration
as int64
in parquet instead of TimeType
. Arrow does it by also writing serialized arrow schema
with parquet files: https://github.com/apache/arrow/issues/23117 and https://github.com/apache/arrow/pull/12449/. Update:
Support for duration[s]/timedelta64[s]
types has been added via arrow:schema
support in cuDF PQ reader and writer and roundtrips faithfully.
For datetime64[s]/timestamp[s]
, both cuDF and Arrow convert [s]
units to [ms]
when writing Parquet and interop/roundtrip faithfully regardless of the unit. Though Arrow does not use arrow:schema
to correct units, we can do so in cuDF if needed.
Question is: Should we do it or leave it be as the notion of unit
in timestamp
columns seems arbitrary (in both cuDF and Arrow) as the data are treated, displayed and interpreted in terms of absolute values since epoch (e.g. 1970-01-01 00:00:01.234
) regardless of the unit. Example:
def datetime_interop():
for type in [
"timestamp[s]",
"timestamp[ms]",
"timestamp[us]",
]:
times = pa.array(
[1234, 3456, 32442], type=type
)
names = ["d"]
pa_table = pa.Table.from_arrays([times], names=names)
buf = BytesIO()
pq.write_table(pa_table, buf)
df2 = cudf.read_parquet(buf)
df3 = pq.read_table(buf)
# prints the same values (ignore units)
print("Original table (pa)\n", pa_table)
print("cudf read parquet\n", df2)
print("pyarrow read parquet\n", df3)
# convert all to pd.Timestamp without caring about column units
value1 = pd.Timestamp(pa_table["d"][0].as_py())
value2 = pd.Timestamp(df2["d"][0])
value3 = pd.Timestamp(df3["d"][0].as_py())
# check equality
assert value1 == value2
assert value1 == value3
# redundant but anyway
assert value2 == value3
Closing this issue for now as units are meaningless for timestamp types as they are treated and displayed in absolute values. Please see the last comment with updates.
Describe the bug Only when we have
timedelta64[s]
dtype for a column, the parquet writer seems to be writing it as atimedelta64[ms]
column which is leading both cudf & pyarrow parquet readers to pickup the column type incorrectly.Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior We are writing all other
timedelta
resolutions(ns
,ms
,us
) correctly. It's a problem only being seen withs
. We should be able to round-trip this type correctly if writer can correctly write this type.Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Add any other context about the problem here.