pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.89k stars 1.82k forks source link

Date arithmetic is overly permissive #12330

Open Wainberg opened 9 months ago

Wainberg commented 9 months ago

Checks

Reproducible example

>>> import datetime
>>> import polars as pl
>>> df = pl.DataFrame({
...     'Date': [datetime.date(2023, 1, 1)],
...     'Datetime': [datetime.datetime(2023, 1, 1, 0, 0, 0)],
...     'Duration': [datetime.timedelta(hours=1)],
...     'Time': [datetime.time(0, 0, 0)]})
>>> df.select(pl.col.Date + 2)
shape: (1, 1)
┌───────┐
│ Date  │
│ ---   │
│ i32   │
╞═══════╡
│ 19360 │
└───────┘
>>> df.select(2 * pl.col.Datetime)
shape: (1, 1)
┌──────────────────┐
│ literal          │
│ ---              │
│ i64              │
╞══════════════════╡
│ 3345062400000000 │
└──────────────────┘
>>> df.select(1 - pl.col.Duration)
shape: (1, 1)
┌─────────────┐
│ literal     │
│ ---         │
│ i64         │
╞═════════════╡
│ -3599999999 │
└─────────────┘
>>> df.select(6 / pl.col.Time)
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ f64     │
╞═════════╡
│ inf     │
└─────────┘

Log output

No response

Issue description

All arithmetic involving dates/times and numerics is allowed, and interprets the dates as their underlying numeric representations.

Expected behavior

Only the following should be allowed (let me know if I'm missing any cases):

Date - Date -> Duration Date + Duration -> Date Date - Duration -> Date

Datetime - Datetime -> Duration Datetime + Duration -> Datetime Datetime - Duration -> Datetime

Duration + Date -> Date Duration + Datetime -> Datetime Duration + Duration -> Duration Duration - Duration -> Duration Duration numeric -> Duration numeric Duration -> Duration Duration / numeric -> Duration Duration // numeric -> Duration

Arithmetic on a date should never return a number.

Installed versions

``` --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: 3.8.1 numpy: 1.26.0 openpyxl: 3.1.2 pandas: 2.1.2 pyarrow: 13.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.1 xlsxwriter: 3.1.7 ```
MarcoGorelli commented 9 months ago

yup, thanks for the report

Wainberg commented 9 months ago

I also noticed that the types of these don't match - shouldn't they both be in μs?

>>> df.select(pl.col.Date - pl.col.Date)
shape: (1, 1)
┌──────────────┐
│ Date         │
│ ---          │
│ duration[ms] │
╞══════════════╡
│ 0ms          │
└──────────────┘
>>> df.select(pl.col.Datetime - pl.col.Datetime)
shape: (1, 1)
┌──────────────┐
│ Datetime     │
│ ---          │
│ duration[μs] │
╞══════════════╡
│ 0µs          │
└──────────────┘
deanm0000 commented 9 months ago

Should Date + Duration raise for Durations that aren't multiples of days or should it just give a datetime response?

ie Date + Duration(hours=6)

Wainberg commented 9 months ago

I'd permit that to be consistent with Python:

>>> datetime.date(year=2023, month=6, day=3) + datetime.timedelta(hours=6)
datetime.date(2023, 6, 3)
Julian-J-S commented 9 months ago

Duration / numeric -> Duration Duration // numeric -> Duration

if Duration / numeric -> Duration and Duration is always backed by intergers then Duration // numeric -> Duration (floor division) should be identical, right? Or should the first round somehow?

Wainberg commented 9 months ago

Another way in which date arithmetic is (arguably) overly permissive is that dates can compare to datetimes, but they're only equal at midnight:

>>> df = pl.DataFrame({
...     'Date': [datetime.date(2023, 1, 1)],
...     'Datetime': [datetime.datetime(2023, 1, 1, 0, 0, 0)],
...     'Duration': [datetime.timedelta(hours=1)]})
>>> df.select(pl.col.Date == pl.col.Datetime)
shape: (1, 1)
┌──────┐
│ Date │
│ ---  │
│ bool │
╞══════╡
│ true │
└──────┘
>>> df.select(pl.col.Date == pl.col.Datetime + pl.col.Duration)
shape: (1, 1)
┌───────┐
│ Date  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
└───────┘
>>> df.select(pl.col.Date == pl.col.Datetime + timedelta(hours=0))
shape: (1, 1)
┌──────┐
│ Date │
│ ---  │
│ bool │
╞══════╡
│ true │
└──────┘
>>> df.select(pl.col.Date == pl.col.Datetime + timedelta(hours=1))
shape: (1, 1)
┌───────┐
│ Date  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
└───────┘
deanm0000 commented 9 months ago

I wouldn't want to be unable to check if a datetime is before/after a date.

Wainberg commented 9 months ago

Python doesn't allow any operations between datetimes and dates, except for comparison which always returns false:

>>> import datetime
>>> datetime.date(2023, 1, 1) == datetime.datetime(2023, 1, 1, 0, 0)
False
Wainberg commented 7 months ago

Also an issue with DataFrame and Series operators:

>>> pl.Series([datetime.date(2023, 1, 1)]) == pl.Series([19358])
shape: (1,)
Series: '' [bool]
[
        true
]
>>> pl.DataFrame([datetime.date(2023, 1, 1)]) == pl.DataFrame([19358])
shape: (1, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ bool     │
╞══════════╡
│ true     │
└──────────┘
douglas-raillard-arm commented 1 month ago

The reproducer in the original report now properly raises on all 4 tests:

df.select(pl.col.Date + 2)
df.select(2 * pl.col.Datetime)
df.select(1 - pl.col.Duration)
df.select(6 / pl.col.Time)

Looks like the fix comes from there: https://github.com/pola-rs/polars/pull/16934