pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

inconsistent execution results from multiple runs #14923

Open henghamao opened 6 months ago

henghamao commented 6 months ago

Checks

Reproducible example

polars 0.20.1

Log output

No response

Issue description

We used polars to calculate rolling mean() and std() from python df, and then convert back to df. The same code and the same data executed multiple runs, and polars might give different results. Here is the completed code to reproduce the issue.

import pandas as pd
import polars as pl

def calculate_rolling_features(df, columns, windows):

    # Convert from pandas to Polars
    pl_df = pl.from_pandas(df)

    # prepare the operations for each column and window
    expressions = []

    # Loop over each window and column to create the rolling mean and std expressions
    for window in windows:
        for col in columns:
            rolling_diff_mean_expr = (
                pl.col(col).diff(window)
                .rolling_mean(window)
                .alias(f'rolling_diff_mean_{col}_{window}')
            )

            rolling_diff_std_expr = (
                pl.col(col).diff(window)
                .rolling_std(window)
                .alias(f'rolling_diff_std_{col}_{window}')
            ) 

            expressions.append(rolling_diff_mean_expr)
            expressions.append(rolling_diff_std_expr)

    # Run the operations using Polars' lazy API
    lazy_df = pl_df.lazy().with_columns(expressions)

    # Execute the lazy expressions and overwrite the pl_df variable
    pl_df = lazy_df.collect()

    # Convert back to pandas if necessary
    df = pl_df.to_pandas()
    return df

bid_price = [2,3,5,1,2,0,2,3,1,0,3,4,2,1,4,5,2,1,1,2]
ask_price = [3,3,1,4,1,0,1,2,1,3,4,1,2,3,1,2,5,6,7,1]
df = pd.DataFrame({'bid_price':bid_price, 'ask_price':ask_price})
df = calculate_rolling_features(df, ['bid_price', 'ask_price'], [3, 5])
print(df['rolling_diff_mean_bid_price_3'].head(20))

Results from the code 1st run: 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 -2.333333 6 -1.666667 7 -1.000000 8 1.000000 9 0.000000 10 -0.333333 11 0.333333 12 1.666667 13 1.000000 14 0.000000 15 0.333333 16 1.333333 17 0.333333 18 -2.000000 19 -2.333333 Name: rolling_diff_mean_bid_price_3, dtype: float64

Results from the code 2nd run: 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 -1.666667 8 -1.000000 9 -1.333333 10 0.333333 11 1.000000 12 1.333333 13 0.333333 14 1.000000 15 2.000000 16 1.333333 17 -0.333333 18 -1.000000 19 -1.000000 Name: rolling_diff_mean_bid_price_3, dtype: float64

For multiple runs, we might get the wrong results like 2nd run. We could get 1 wrong results from 20 running.

Expected behavior

We expected to get correct results like 1st run.

Installed versions

``` Replace this line with the output of pl.show_versions(). Leave the backticks in place. ``` polars 0.20.1
henghamao commented 6 months ago

We have tired the latest version 0.20.14, still met the issue.

sahuagin commented 5 months ago

Can you try multiple runs with POLARS_MAX_THREADS=1 and see if the results stabilize?

henghamao commented 5 months ago

I added " POLARS_MAX_THREADS=1" to the code, but the result is still not stabilize.

import pandas as pd
import polars as pl

def calculate_rolling_features(df, columns, windows):

    # Convert from pandas to Polars
    pl_df = pl.from_pandas(df)

    # prepare the operations for each column and window
    expressions = []

    # Loop over each window and column to create the rolling mean and std expressions
    for window in windows:
        for col in columns:
            rolling_diff_mean_expr = (
                pl.col(col).diff(window)
                .rolling_mean(window)
                .alias(f'rolling_diff_mean_{col}_{window}')
            )

            rolling_diff_std_expr = (
                pl.col(col).diff(window)
                .rolling_std(window)
                .alias(f'rolling_diff_std_{col}_{window}')
            ) 

            expressions.append(rolling_diff_mean_expr)
            expressions.append(rolling_diff_std_expr)

    # Run the operations using Polars' lazy API
    lazy_df = pl_df.lazy().with_columns(expressions)

    # Execute the lazy expressions and overwrite the pl_df variable
    pl_df = lazy_df.collect()

    # Convert back to pandas if necessary
    df = pl_df.to_pandas()
    return df

POLARS_MAX_THREADS=1
bid_price = [2,3,5,1,2,0,2,3,1,0,3,4,2,1,4,5,2,1,1,2]
ask_price = [3,3,1,4,1,0,1,2,1,3,4,1,2,3,1,2,5,6,7,1]
df = pd.DataFrame({'bid_price':bid_price, 'ask_price':ask_price})
df = calculate_rolling_features(df, ['bid_price', 'ask_price'], [3, 5])
print(df['rolling_diff_mean_bid_price_3'].head(20))
EternalMoment commented 5 months ago

Yeah, I could replicate this issue on 0.20.15