rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.17k stars 879 forks source link

[BUG] Data loss for extraction of year from far future date and datetime types #16196

Closed wence- closed 1 week ago

wence- commented 1 month ago

Describe the bug

libcudf's cudf::datetime::extract_year returns an INT16 column, this can lose information for large positive or negative years.

The date32 type is:

signed 32 bit number of days since the unix epoch

The timestamp types are (for resolutions milli-, micro-, and nano-seconds):

signed 64 bit number of RESOLUTION ticks since the unix epoch

The must positive year representable by the date32 type is (approximately) $1970 + (2^{31} - 1)/365 \approx 5885486 \gg 2^{15} - 1$.

Similarly the most positive year representable by the timestamp64[ms] and timestamp64[us] types is respectively approximately 292473178 and 294441. Both of which are again larger than $2^{15} - 1$.

Steps/Code to reproduce bug

import cudf

s = cudf.Series([2**63 - 1], dtype="datetime64[us]")

cudf_year = s.dt.year[0]

pandas_year = s.to_pandas().dt.year[0]

print(cudf_year) # 32103, incorrect
print(pandas_year) # 294247, correct, depending on how much the earth's rotation speed changes of the next few millenia

Expected behavior

We should produce the right answer. This might be doable by returning an INT32 column for year extraction.

wence- commented 1 month ago

This is a bit fiddly since std::chrono specifies that the minimum and maximum values of representable years are $-2^{15}$ and $2^{15} - 1$ respectively. So given the manipulations rely on cuda::std::chrono, this may not be fixable.

davidwendt commented 3 weeks ago

libcudf will not likely support date/time functions outside what std::chrono or cuda::std::chrono supports. Also, I believe leap seconds are supposed to account for the speed up of the earth's rotation.

wence- commented 3 weeks ago

I'm happy to wontfix this one.

[Aside: Yes, leap seconds do fix this, but we don't know about them until they happen, so the answer panda returns is correct "now", but the answer obtained now might be wrong once that date rolls round :)]

vyasr commented 1 week ago

Closing based on the above discussion.