pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.83k stars 1.92k forks source link

Slow rolling_skew performance and inconsistent signature vs other rolling methods #17339

Open jackaixin opened 3 months ago

jackaixin commented 3 months ago

Description

I was testing a few rolling group_by functions and it appears that rolling_skew in particular is very slow compared to the implementation in pandas.

Here's the dataframe I used for testing:

holidays = [datetime.date(y, 7, 4) for y in range(2000, 2025)]
dates = pl.select(
    xdt.date_range(datetime.date(2000, 7, 1), datetime.date(2024, 7, 31), interval='1bd', holidays=holidays)
    .alias('date')
)

values = np.random.rand(len(dates) * 500)
values[523], values[1040] = np.nan, np.nan  # just throwing in some nans

df = pl.LazyFrame(
    {
        't': itertools.chain.from_iterable([
            [d] * 500 
             for d in xdt.date_range(
                 datetime.date(2000, 7, 1), datetime.date(2024, 7, 31), 
                 interval='1bd', 
                 holidays=holidays,
                 eager=True
             )]),
        'id': itertools.chain.from_iterable(range(1, 501) for _ in range(len(dates))),
        'value': values
    }
).with_columns(
    pl.col('value').fill_nan(None)
).collect()

As a baseline, I used pandas and checked its performance:

df_pd = df.to_pandas().set_index(['t', 'id'])

# 422 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df_pd.groupby('id').apply(
    lambda s: s.rolling(window=5, min_periods=4).skew()
).reset_index(level=0, drop=True).sort_index()

However rolling_skew in polars is 4x slower:

# 1.63 s ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
q_rolling_skew = (
    df.lazy().with_columns(
        pl.col('value').rolling_skew(window_size=5, bias=False).over('id').alias('skew')  # why no min_periods arg?
    )
)

q_rolling_skew.collect()

Also, I see all the other rolling functions have a min_periods argument in their signatures, which is not the case for rolling_skew. This is a bit inconsistent and I wonder whether there's a particular reason for this.

In addition, this might also be a good opportunity to pick up the stale rolling_kurtosis feature request. https://github.com/pola-rs/polars/issues/4235


As a side note, I see in this issue https://github.com/pola-rs/polars/issues/2974 that @ritchie46 mentioned one can use groupby_rolling (now Dataframe.rolling) in combination with kurtosis expression to achieve rolling_kurtosis. I also gave that a try, but the performance was also slower than pandas and DataFrame.rolling doesn't handle business day (very commonly used in financial datasets) well so I had to manually add an index to the original dataframe, which is a bit inconvenient (and probably not idiomatic either)?

Something like Expr.rolling_kurtosis which returns an Expr is much preferred than the hacky DataFrame.rolling implementation. Expr.rolling_ also supports rolling window that depends only on the number of rows, which is similar to pandas. In fact, as this issue (https://github.com/pola-rs/polars/issues/12051) points out, a generic rolling+over syntax that allows chained expressions can be super handy.

Below are the two implementations I tried (on the same dataset above) which took similar amount of time.

# 2.47 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
q_rolling = (
    df.lazy().join(
        df.lazy().select(pl.col('t').unique()).sort('t').with_row_index('t_idx'), on='t'
    )
    .rolling(index_column='t_idx', period='5i', group_by='id').agg(
        pl.col('t').last(),
        pl.when(pl.col('value').count() >= 4)
        .then(pl.col('value').kurtosis(bias=False))
        .otherwise(None)
        .alias('kurt')
    )
    .select('t', 'id', 'kurt')
    .sort(['t', 'id'])
)

q_rolling.collect()
q_rolling_1 = (
    df.lazy().sort(['id', 't'])
    .rolling(index_column=pl.int_range(pl.len()).alias('t_idx'), period='5i', group_by='id').agg(
        pl.col('t').last(),
        pl.when(pl.col('value').count() >= 4)
        .then(pl.col('value').kurtosis(bias=False))
        .otherwise(None)
        .alias('kurt')
    )
    .select('t', 'id', 'kurt')
    .sort(['t', 'id'])
)

q_rolling_1.collect()
xuJ14 commented 3 months ago

a generic rolling+over syntax that allows chained expressions can be super handy.

DataFrame.rolling doesn't handle business day (very commonly used in financial datasets) well so I had to manually add an index to the original dataframe, which is a bit inconvenient

Can't agree more

firmai commented 2 months ago

Hopeful the speed for kurtosis could be picked up too, not sure what is happening behind the scenes.

ritchie46 commented 2 months ago

DataFrame.rolling doesn't handle business day (very commonly used in financial datasets) well

@MarcoGorelli Can't do this with the businessday expressions?

MarcoGorelli commented 2 months ago

I think we can do this 😎 add_business_days / business_day_count have been in for a few months without complaints (as far I've seen) so I think it's time to bring them to some other methods which accept time offsets

xuJ14 commented 1 month ago

I think, python_holiday is not flexible and reliable enough. As far as I know, they only include NYSE. What if some people want financial holidays from other market, such as London or JPX?

MarcoGorelli commented 1 month ago

it's up to the user to provide the list of holidays, it's up to you if you if you use python_holiday or whatever other calendar 📆