pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.89k stars 1.92k forks source link

Rolling_std for large numbers #9383

Open skswlsaks opened 1 year ago

skswlsaks commented 1 year ago

Polars version checks

Issue description

I recently i have tested rolling_std on a polars series.

import polars as pl 
a = pl.Series("a", [1574669000.00,1574669000.00,1976946000.00,2313781000.00,2313781000.00,2295767000.00,2295767000.00,2306270000.00,2306270000.00,2257469000.00,2219556000.00,2219556000.00,2169984000.00,2169984000.00,2820376000.00,2820376000.00,2820376000.00])
a.rolling_std(window_size=3, ddof=1)

Gives me following result

shape: (17,)
Series: 'a' [f64]
[
        null
        null
        2.3225e8
        3.7004e8
        1.9447e8
        1.0400e7
        1.0400e7
        6.0639e6
        6.0639e6
        2.8175e7
        4.3471e7
        2.1889e7
        2.8620e7
        2.8620e7
        3.7550e8
        3.7550e8
        45.254834
]

But obviously last three numbers are identical, which should give me 0 on the last number.

Reproducible example

import polars as pl 
a = pl.Series("a", [1574669000.00,1574669000.00,1976946000.00,2313781000.00,2313781000.00,2295767000.00,2295767000.00,2306270000.00,2306270000.00,2257469000.00,2219556000.00,2219556000.00,2169984000.00,2169984000.00,2820376000.00,2820376000.00,2820376000.00])
a.rolling_std(window_size=3, ddof=1)

Expected behavior

shape: (17,)
Series: 'a' [f64]
[
        null
        null
        2.3225e8
        3.7004e8
        1.9447e8
        1.0400e7
        1.0400e7
        6.0639e6
        6.0639e6
        2.8175e7
        4.3471e7
        2.1889e7
        2.8620e7
        2.8620e7
        3.7550e8
        3.7550e8
        0
]

Installed versions

``` --------Version info--------- Polars: 0.18.2 Index type: UInt32 Platform: Linux-5.15.0-73-generic-x86_64-with-glibc2.35 Python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0] ----Optional dependencies---- numpy: 1.24.3 pandas: 1.5.2 pyarrow: 10.0.1 connectorx: 0.3.1 deltalake: fsspec: matplotlib: xlsx2csv: xlsxwriter: ```
avimallu commented 1 year ago

May be similar to https://github.com/pola-rs/polars/issues/9318 and a slightly clearer case where higher accuracy is needed? The Kahan Summation Wikipedia page mentioned also links to similar methods to handle variance calculation more accuractely.

I guess the key takeaway should be (from that thread), it shouldn't matter for practical data science applications since 45 << 3.7550e8?