pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Simple numerical calculation expressions on Int-columns can give incorrect results. #16019

Closed TobiasDummschat closed 1 week ago

TobiasDummschat commented 2 weeks ago

Checks

Reproducible example

data = {"a":[1]}
expr =  (pl.col("a") - 0.5) / 1 * 1
pl.DataFrame(data).select(expr)["a"][0]
# 0.0 instead of 0.5

Log output

No response

Issue description

In the example provided above, the simple calculation of (1 - 0.5) / 1 * 1 produces the incorrect result of 0.0 instead of 0.5. Looking at the LazyFrame.explain, the cause seems to be an unexpected .cast(Int32) and the … / 1 is int and the … * 1.0 is float.

pl.LazyFrame(data).select(expr).explain()
# ' SELECT [[([([(col("a").cast(Float64)) - (0.5)].cast(Int32)) / (1)]) * (1.0)]] FROM\n  DF ["a"]; PROJECT 1/1 COLUMNS; SELECTION: "None"'

Interestingly, very similar expressions give different results and don't seem to cast to int mid-expression.

pl.col("a") - 0.5  # 0.5, correct
(pl.col("a") - 0.5) / 1  # 0.5, correct
(pl.col("a") - 0.5) / 1 * 1  # 0.0, incorrect

In more complex calculations, this can lead to incorrect numbers that seem correct because they're still in the right ballpark. I caught this with (18 - 10.5) / 2 * 15 which should be 56.25, but was 52.5.

Replacing pl.col("a") with pl.col("a").cast(float) fixes the issue.

Expected behavior

I'd expect all of these calculations to give the same result I'd get when typing the calculation in for each row. I'd also expect all of them to behave the same way.

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Python: 3.11.6 (main, Jan 10 2024, 11:09:44) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
NicolasMuellerQC commented 1 week ago

I think this is probably the same underlying problem as in #15951 and #15952

TobiasDummschat commented 1 week ago

This is fixed in 0.20.24