pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Polars assumes microseconds instead of reading numpy timedelta units #16029

Open hixan opened 2 weeks ago

hixan commented 2 weeks ago

Checks

Reproducible example

import numpy as np, numpy as np
print(pl.DataFrame([[np.timedelta64(1, 'ns')]]))
>>> shape: (1, 1)
┌──────────────┐
│ column_0     │
│ ---          │
│ duration[μs] │
╞══════════════╡
│ 1µs          │
└──────────────┘

Log output

No response

Issue description

Polars does not seem to read the time unit associated with the numpy timedelta64, and instead assumes microsecond.

Expected behavior

The returned column should have the same time unit as the numpy object (in this case ns)

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-5.14.0-284.18.1.el9_2.x86_64-x86_64-with-glibc2.28 Python: 3.11.8 (main, Feb 16 2024, 19:42:16) [GCC 8.5.0 20210514 (Red Hat 8.5.0-20)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.1 pandas: 1.5.3 pyarrow: 11.0.0 pydantic: 1.10.15 pyiceberg: pyxlsb: sqlalchemy: 1.4.52 xlsx2csv: xlsxwriter: ```
datenzauberai commented 1 day ago

Handling of time units and proper casting is missing when the numpy values are inside a python list instead of a numpy arrray:

pl.DataFrame({
    # ok
    "np_array_ns": np.array([np.timedelta64(1000000000, "ns"), np.timedelta64(1000000000, "ns")]),
    # ok
    "np_array_us": np.array([np.timedelta64(1000000, "us"),    np.timedelta64(1000000, "us")]),
    # ok
    "np_array_ms": np.array([np.timedelta64(1000, "ms"),    np.timedelta64(1000, "ms")]),
    # ignores time unit
    "py_list_ns": [np.timedelta64(1000000000, "ns"), np.timedelta64(1000000000, "ns")],
    # TypeError
    "py_list_us": [np.timedelta64(1000000, "us"),    np.timedelta64(1000000, "us")],
    # TypeError
    "py_list_ms": [np.timedelta64(1000, "ms"),    np.timedelta64(1000, "ms")],
})