pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Panic when constructing Series with dtype `Duration('ms')` with large `timedelta` objects #16042

Closed stinodego closed 2 weeks ago

stinodego commented 2 weeks ago

Checks

Reproducible example

from datetime import timedelta
import polars as pl

v = timedelta.max
s = pl.Series([v], dtype=pl.Duration("ms"))
print(s)

Log output

thread '<unnamed>' panicked at py-polars/src/conversion/any_value.rs:211:41:
called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'OverflowError'>, value: OverflowError('Python int too large to convert to C long'), traceback: None }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/stijn/code/polars/py-polars/repro.py", line 7, in <module>
    s = pl.Series([v], dtype=pl.Duration("ms"))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stijn/code/polars/py-polars/polars/series/series.py", line 315, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/stijn/code/polars/py-polars/polars/_utils/construction/series.py", line 194, in sequence_to_pyseries
    py_series = PySeries.new_from_any_values(name, values, strict)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'OverflowError'>, value: OverflowError('Python int too large to convert to C long'), traceback: None }

Issue description

The issue is that the AnyValue conversion py_object_to_any_value does not have the dtype information. It tries to parse this as a microseconds value first, which will overflow.

The fix is to create a new conversion util py_object_and_dtype_to_any_value, which takes a data type in addition to the object. Then we can parse the value with the correct time unit. It would also allow skipping type inference so there would be a minor performance benefit.

To show that this should work, the following works fine:

from datetime import timedelta
import polars as pl
from polars._utils.convert import timedelta_to_int

v = timedelta.max
v_int = timedelta_to_int(v, "ms")
s = pl.Series([v_int]).cast(pl.Duration("ms"))
print(s)
shape: (1,)
Series: '' [duration[ms]]
[
        999999999d 23h 59m 59s 999ms
]

Expected behavior

Should work.

Installed versions

main