pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.23k stars 1.67k forks source link

Evaluation of `polars.datetime` returns null when dealing with nanoseconds #16124

Open marenwestermann opened 1 month ago

marenwestermann commented 1 month ago

Checks

Reproducible example

>>> import polars as pl
>>> a = pl.datetime(2024, 1, 1, 2, 2, 2, 123456789)
>>> pl.select(a)
shape: (1, 1)
┌──────────────┐
│ datetime     │
│ ---          │
│ datetime[μs] │
╞══════════════╡
│ null         │
└──────────────┘
>>> b = pl.datetime(2024, 1, 1, 2, 2, 2, 123456789, time_unit='ns')
>>> pl.select(b)
shape: (1, 1)
┌──────────────┐
│ datetime     │
│ ---          │
│ datetime[ns] │
╞══════════════╡
│ null         │
└──────────────┘

Log output

No response

Issue description

It is possible to include nanoseconds when creating an expression with polars.datetime. However, when the expression gets evaluated, the result is null (see examples above).

Expected behavior

A warning should be raised that polars.datetime cannot be evaluated if nanoseconds are included. Additionally, the option "ns" might need to be removed from the documentation of the parameter time_unit.

Installed versions

``` --------Version info--------- Polars: 0.20.25 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.12.3 (v3.12.3:f6650f9ad7, Apr 9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: 0.11.0 cloudpickle: 3.0.0 connectorx: deltalake: 0.17.3 fastexcel: 0.10.4 fsspec: 2023.12.2 gevent: 24.2.1 hvplot: 0.10.0 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.0.0 pydantic: 2.7.1 pyiceberg: 0.6.1 pyxlsb: 1.0.10 sqlalchemy: 2.0.30 torch: xlsx2csv: 0.8.2 xlsxwriter: 3.2.0 ```
marenwestermann commented 1 month ago

ping @MarcoGorelli

datenzauberai commented 1 month ago

The problem here is that time_unit determines the internal representation of the polars.datetime and not the unit of the seventh parameter (which is always microseconds).

ms = pl.datetime(1990, 12, 31, 10, 0, 59, 999999, time_unit="ms").alias("ms")
ns = pl.datetime(1990, 12, 31, 10, 0, 59, 999999, time_unit="ns").alias("ns")
pl.select(ms, ns)

However, I think this should not silently fail and return null or overflow when microseconds > 999999 which it does not do consistently.

# overflows to 1991-01-01 00:00:00.200
pl.select(pl.datetime(1990, 12, 31, 23, 59, 59, 1200000, time_unit="ns").alias("ns"))
# returns null
pl.select(pl.datetime(1990, 12, 31, 23, 59, 58, 1200000, time_unit="ns").alias("ns"))

Either it should fail, return null or overflow?!