pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.43k stars 1.97k forks source link

Inappropriate result of expr skew and kurtosis for a magic constant inputs #18617

Open JackieJin1025 opened 2 months ago

JackieJin1025 commented 2 months ago

Checks

Reproducible example

import polars as pl
a = pl.DataFrame({"a": [1.0042855193121334] * 60})
for l in range(60):
     print(l, a.slice(0, l).select(pl.col('a').kurtosis()).item())

output

image

Log output

I did not obtain log here.

Issue description

expect to constantly get nan for any size, but get None when size is 0 and -2 when size >= 22

Expected behavior

import polars as pl
a = pl.DataFrame({"a": [1.0042855193121334] * 60})
for l in range(60):
      print(l, st.kurtosis(a.slice(0, l).select(pl.col('a')).to_numpy().flatten()))

get nan consistently

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: Linux-3.10.107-1-tlinux2_kvm_guest-0056-x86_64-with-glibc2.28 Python: 3.9.9 (main, Apr 24 2023, 09:37:21) [GCC 10.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fastexcel: fsspec: 2023.4.0 gevent: 22.10.2 great_tables: hvplot: matplotlib: 3.9.2 nest_asyncio: 1.5.6 numpy: 1.24.2 openpyxl: 3.1.2 pandas: 1.5.3 pyarrow: 15.0.2 pydantic: 1.10.18 pyiceberg: sqlalchemy: 2.0.24 torch: 1.11.0+cu102 xlsx2csv: xlsxwriter: ```

related issue: https://github.com/pola-rs/polars/issues/15067

deanm0000 commented 2 months ago

hmm I guess the difference between rust's f64::EPSILON=2.220446049250313e-16 and numpy's np.finfo(np.float64).resolution=1e-15 is relevant here. I'm not sure if the right fix would be to just hard code 1e-15 instead of using f64::EPSILON, scale f64::EPSILON by the sample size (somehow), check if all inputs are mostly equal, or something else.