pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.94k stars 1.93k forks source link

Allow `min_periods` in `rolling_` functions to take dynamic temporal size #13566

Open dylan-lee94 opened 9 months ago

dylan-lee94 commented 9 months ago

Description

I'm using polars.Expr.rolling_sum for stock return calculations, in particular for computing 1 month trailing stock returns. Here it would be great if min_periods ensures that the 1 month trailing return is Nullwhen there is not enough data available for the entire window size.

Let's create some dummy stock returns

from datetime import date
import polars as pl
import numpy as np

np.random.seed(42)

df = pl.date_range(date(2022, 1, 1), date(2022, 3, 1), interval="1d", eager=True).alias("CALENDAR_DATE").to_frame()
df = df.with_columns(pl.Series(name="VALUE", values=np.random.randint(90, high=110, size=df.height, dtype=int)))
df = df.with_columns(pl.col("VALUE").pct_change().alias("DAILY_RETURN")).drop_nulls()

In the current implementation, min_periods is set to 1, if window_size is a dynamic temporal size.

df1 = df.sort('CALENDAR_DATE')\
        .with_columns(((1+pl.col('DAILY_RETURN')).log().rolling_sum(by='CAL
ENDAR_DATE', window_size='1mo').exp()-1)\
        .alias('1_MONTH_TRAILING_RETURN')
        )
CALENDAR_DATE VALUE DAILY_RETURN 1_MONTH_TRAILING_RETURN
2022-02-02 109 0.135416666666667 0.135416666666667
2022-02-03 104 -0.0458715596330275 0.0833333333333335
2022-02-04 100 -0.0384615384615385 0.0416666666666667
2022-02-05 97 -0.03 0.0104166666666667
2022-02-06 96 -0.0103092783505155 2.22044604925031E-16
2022-02-07 108 0.125 0.125
2022-02-08 100 -0.0740740740740741 0.0416666666666667
2022-02-09 100 0 0.0416666666666667
2022-02-10 93 -0.07 -0.0312499999999999
2022-02-11 97 0.0430107526881721 0.0104166666666667
2022-02-12 92 -0.0515463917525773 -0.0416666666666666
2022-02-13 91 -0.0108695652173913 -0.0520833333333334
2022-02-14 101 0.10989010989011 0.0520833333333333
2022-02-15 95 -0.0594059405940594 -0.0104166666666667
2022-02-16 91 -0.0421052631578947 -0.0520833333333334
2022-02-17 90 -0.010989010989011 -0.0625
2022-02-18 101 0.122222222222222 0.0520833333333333
2022-02-19 101 0 0.0520833333333333
2022-02-20 106 0.0495049504950495 0.104166666666667
2022-02-21 99 -0.0660377358490566 0.03125
2022-02-22 105 0.0606060606060606 0.09375
2022-02-23 104 -0.00952380952380952 0.0833333333333333
2022-02-24 104 0 0.0833333333333333
2022-02-25 108 0.0384615384615385 0.125
2022-02-26 101 -0.0648148148148148 0.0520833333333335
2022-02-27 109 0.0792079207920792 0.135416666666667
2022-02-28 92 -0.155963302752294 -0.0416666666666664
2022-03-01 94 0.0217391304347826 -0.0208333333333329
2022-03-02 108 0.148936170212766 -0.00917431192660512
2022-03-03 96 -0.111111111111111 -0.0769230769230767
2022-03-04 98 0.0208333333333333 -0.0199999999999998
2022-03-05 96 -0.0204081632653061 -0.0103092783505152
2022-03-06 107 0.114583333333333 0.114583333333333
2022-03-07 93 -0.130841121495327 -0.138888888888889

It would be great to allow min_periods to take a dynamic temporal size, similarly to:

df2 = df.sort('CALENDAR_DATE')\
        .with_columns(((1+pl.col('DAILY_RETURN')).log().rolling_sum(by='CALENDAR_DATE', window_size='1mo', min_periods='1mo').exp()-1)\
        .alias('1_MONTH_TRAILING_RETURN')
        )
CALENDAR_DATE VALUE DAILY_RETURN 1_MONTH_TRAILING_RETURN
2022-02-02 109 0.135416666666667
2022-02-03 104 -0.0458715596330275
2022-02-04 100 -0.0384615384615385
2022-02-05 97 -0.03
2022-02-06 96 -0.0103092783505155
2022-02-07 108 0.125
2022-02-08 100 -0.0740740740740741
2022-02-09 100 0
2022-02-10 93 -0.07
2022-02-11 97 0.0430107526881721
2022-02-12 92 -0.0515463917525773
2022-02-13 91 -0.0108695652173913
2022-02-14 101 0.10989010989011
2022-02-15 95 -0.0594059405940594
2022-02-16 91 -0.0421052631578947
2022-02-17 90 -0.010989010989011
2022-02-18 101 0.122222222222222
2022-02-19 101 0
2022-02-20 106 0.0495049504950495
2022-02-21 99 -0.0660377358490566
2022-02-22 105 0.0606060606060606
2022-02-23 104 -0.00952380952380952
2022-02-24 104 0
2022-02-25 108 0.0384615384615385
2022-02-26 101 -0.0648148148148148
2022-02-27 109 0.0792079207920792
2022-02-28 92 -0.155963302752294
2022-03-01 94 0.0217391304347826
2022-03-02 108 0.148936170212766 -0.00917431192660512
2022-03-03 96 -0.111111111111111 -0.0769230769230767
2022-03-04 98 0.0208333333333333 -0.0199999999999998
2022-03-05 96 -0.0204081632653061 -0.0103092783505152
2022-03-06 107 0.114583333333333 0.114583333333333
2022-03-07 93 -0.130841121495327 -0.138888888888889

I am not entirely sure, but this might already be addressed by #12049.

orlp commented 9 months ago

Do I understand you correctly that you'd want a mininum_span parameter such that when the samples in the window span a period in length less than this quantity null is returned?

If so we could consider adding that after the rework into rolling and rolling_interval, but it would be blocking work right now.

dylan-lee94 commented 9 months ago

Yes exactly. In the above described example, with a dynamic mininum_span argument of 1moI would expect the function to return values once the window spans a period length from 2022-02-02 to 2022-03-02 (depending on closed argument).