pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.53k stars 1.88k forks source link

[FEAT]: `rolling_map()` should copy `.rolling()` + `.map_elements()` #14694

Open jjfantini opened 7 months ago

jjfantini commented 7 months ago

Description

Improvement

The window_size parameter should accept str and timedelta types.

There are times when a user needs to apply a custom function to a rolling window. The beauty of the rolling_* functions is that you can roll the date dynamically via the specified string language in other rolling_* functions. This is useful for financial time series where you have unequal date intervals, but want to roll over "1m" of data, and using an integer value does not suffice (due to changing window sizes).

In pandas, you can use .rolling("1m").apply(<func>) which will dynamically roll the function over a shifting window. In Polars, you can use:

out = data.set_sorted("date").rolling(index_column="date", period="2d").agg(
    pl.col("log_returns").map_elements(annual_vol)
)

Design:

MarcoGorelli commented 7 months ago

Thanks for the issue - to expedite resolution could you show an example of what you'd like to do with expected output please?

jjfantini commented 7 months ago

Yes I can do that :)

Personally I am using this to add a column to a pl.DataFrame where I have a custom function _annual_vol that needs to compute the rolling volatility for every month.

So here is a use case for an internal function rolling_std():

import numpy as np
import datetime as dt

trading_periods = (252,)
_column_name_returns: str = "log_returns"

dates = pl.Series(
    [
        dt.datetime(2021, 1, 29),
        dt.datetime(2021, 1, 30),
        dt.datetime(2021, 1, 31),
        dt.datetime(2021, 2, 1),
        dt.datetime(2021, 2, 2),
        dt.datetime(2021, 2, 3),
        dt.datetime(2021, 2, 4),
        dt.datetime(2021, 2, 5),
        dt.datetime(2021, 2, 8),
        dt.datetime(2021, 2, 9),
    ]
)

data = pl.DataFrame(
    {
        "log_returns": [2, 4, 6, 5, 3, 7, 2, 8, 4, 5],
        "date": dates
    }
)
vol = data.set_sorted("date").select(
    pl.col(_column_name_returns).rolling_std(
        window_size=3, min_periods=1, by="date"
    )
    * np.sqrt(trading_periods)
)

Here is a similar function, but I cannot use window_size="2d" to specify a width of 2 days. I have to use an integer. When the dataset becomes larger and I would like to use "1m" I cannot set it to just 21, becuase that can change from month to month.

import numpy as np
import datetime as dt

def annual_vol(data: pl.Series, trading_periods: int = 252) -> pl.Series:
    return (trading_periods * data.mean()) ** 0.5

trading_periods = (252,)
_column_name_returns: str = "log_returns"

dates = pl.Series(
    [
        dt.datetime(2021, 1, 29),
        dt.datetime(2021, 1, 30),
        dt.datetime(2021, 1, 31),
        dt.datetime(2021, 2, 1),
        dt.datetime(2021, 2, 2),
        dt.datetime(2021, 2, 3),
        dt.datetime(2021, 2, 4),
        dt.datetime(2021, 2, 5),
        dt.datetime(2021, 2, 8),
        dt.datetime(2021, 2, 9),
    ]
)

data = pl.DataFrame(
    {"log_returns": [2, 4, 6, 5, 3, 7, 2, 8, 4, 5], "date": dates}
)
vol = data.set_sorted("date").select(
    pl.col(_column_name_returns).rolling_map(annual_vol,
        window_size="2d", min_periods=1
    )
    * np.sqrt(trading_periods)
)

You can get similar functionaility in Polars using the .rolling() & .map_elements() functions:

vol = data.set_sorted("date").rolling(index_column="date", period="2d").agg(
    pl.col("log_returns").map_elements(annual_vol)
)

BUT, I think that this should be integrated into the .rolling_map function, since it seems redundant to have both avialable and one lacking a feature of the other?

There should be a clarification on using .rolling() that a timedelta parameter for period will only compute consecutive date agregations. If there is a weekend skipped and the date is not avail. in the data, whn using the rolling().agg() logic, the date prior is not included in the calculation. This should be included, or let the user decide.

Basically, rolling_map() should copy the functionality of rolling_* polars functions and allow window_size to be timedelta or str. :)