pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.83k stars 1.92k forks source link

feature: .rolling_slope() #8861

Open supermarin opened 1 year ago

supermarin commented 1 year ago

Problem description

Following up on https://github.com/pola-rs/polars/issues/5493, .rolling_cov() and .rolling_corr() were added (thank you!), what's still missing is .rolling_slope(). Would highly appreciate if you could add that one as well.

To remove ambiguity, I'm referring to a rolling slope of a regression line: https://support.microsoft.com/en-us/office/slope-function-11fb8f97-3117-4813-98aa-61d7e01276b9

supermarin commented 1 year ago

@ritchie46 ping in case this got lost in the issues. Let me know if you're too busy to add it and I might try to take a stab

MarcoGorelli commented 1 year ago

here's one way you could do it manually for now, if your points are evenly spaced:

In [63]: def slope(x, y):
    ...:     numerator = ((x - x.mean())*(y - y.mean())).sum()
    ...:     denominator = ((x - x.mean())**2).sum()
    ...:     return (numerator/denominator).alias('slope')
    ...:

In [64]: df.group_by_rolling(pl.col('ts'), period='5d').agg(slope(pl.col('ts').cast(pl.Int64), pl.col('n')))
Out[64]:
shape: (367, 2)
┌────────────┬───────────┐
│ ts         ┆ slope     │
│ ---        ┆ ---       │
│ date       ┆ f64       │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ NaN       │
│ 2020-01-02 ┆ -0.502773 │
│ 2020-01-03 ┆ -0.530439 │
│ 2020-01-04 ┆ -0.248335 │
│ …          ┆ …         │
│ 2020-12-29 ┆ 0.231307  │
│ 2020-12-30 ┆ 0.544151  │
│ 2020-12-31 ┆ 0.33194   │
│ 2021-01-01 ┆ -0.007023 │
└────────────┴───────────┘

Eyeballing the last point, it seems to match scipy.stats:

In [54]: stats.linregress(np.arange(5), df['n'].tail(5)).slope
Out[54]: -0.007022641517739636
supermarin commented 1 year ago

@MarcoGorelli thanks! Need to measure the performance of agg(),, but using .apply() with a python function performed significantly slower than converting the whole DF to Pandas and doing vectorized computation with numpy.

MarcoGorelli commented 1 year ago

Yeah apply is significantly slower and should only be a last resort

Could you share your code of how you achieved this though please?

supermarin commented 1 year ago

@MarcoGorelli hmm. Didn't see this code in a while (before I filed this issue), and upon returning to it, it looks like I was wrong: I'm still using .rolling().apply() but in Pandas. In any case, this was measurably faster than hand rolled Polars-only solution when I wrote it. It's the bottleneck of the algo it's running inside, and a vectorized solution like .rolling_corr would speed it up a bunch.

Speaking of it, since these are very similar functions, I think having something like rolling_lineregress that exposes the same params as scipy's lineregress would be even better since I'm doing 2-3 of these computations one after the other.

This is a snippet from what I'm running ATM:

    a = pl.DataFrame(...)
    # TODO: figure out how to do this fast without Pandas.

    b = a.to_pandas()
    b["slope"] = (
        b.groupby("ticker")["log"]
        .rolling(90)
        .apply(_compute_slope)
        .reset_index(0, drop=True)
    )

    c = (
        pl.from_pandas(b)
        .with_columns(...)
    )

# Linear regression
# Param data: pd.DataFrame
def _compute_slope(data):
    return np.polyfit(data.index.values, data.values, 1)[0]
MarcoGorelli commented 1 year ago

thanks - how does the polars agg solution I posted above compare with pandas rolling apply?

orlp commented 1 year ago

Instead of adding yet another specific function I would rather see future automatic in-engine optimizations to compute rolling sums/means and such efficiently automatically. Ideally @MarcoGorelli's solution should be translated automatically to a fast implementation.

supermarin commented 1 year ago

thanks - how does the polars agg solution I posted above compare with pandas rolling apply?

@MarcoGorelli not sure I'm able to get it working because my X axis contains a series of numbers (1,2,3...n) and not dates. As far as I understand, group_by_rolling and group_by_dynamic only work with datetime.

MarcoGorelli commented 1 year ago

It doesn't have to be a datetime - check the docs:

In case of a group_by_rolling on an integer column, the windows are defined by:

“1i” # length 1

“10i” # length 10

(having said that, the docs could probably do with an example of this...)

supermarin commented 10 months ago

@MarcoGorelli sorry for getting back to you so late on this. I achieved a 4x speedup using .rolling().agg(slope) compared to converting to Pandas & using Numpy which is amazing! Will try to catch time to benchmark this workaround vs .rolling_corr (it's almost the same function so I expect it to perform similarly) and will report here.

This was not super trivial since I had one more dimension to group over, and using .over() in aggregations is not supported in Polars yet. There's also lack of by= in Expr.rolling compared to DataFrame.rolling, so couldn't use that either. After many trial&error attempts, I'm not super happy with how I made it work: needed to store the first dataframe into a variable, compute slope & store in a second df, then join them.

Agree with @orlp , and also had one more thought: since these are quite common operations, do you think it would be viable to have a method like lineregress from scipy?
For example, in order to compute the r-value (rolling_corr), you need to compute all elements of computing slope anyways. This way we're ditching free computation that was already performed and doing it all over again.

graceyangfan commented 10 months ago

In fact slope can compute like this:

def slope(x:pl.Expr, y: pl.Expr) -> pl.Expr:
    """
    Calculate the slope of a linear regression line between x and y.

    Parameters
    ----------
    x : pl.Expr
        The x values of the linear regression line.
    y : pl.Expr
        The y values of the linear regression line.

    Returns
    -------
    pl.Expr
        The slope of the linear regression line.
    """
    return (pl.corr(x, y) * pl.std(y)) / pl.std(x)

hope one day we can compute rollingslope similar to this instead of adding too much rolling* function:

df.with_columns([
    pl.rolling(period).apply(
    slope(pl.col(col_A),pl.col(col_B))
    )
])
FangyangJz commented 8 months ago

事实上斜率可以这样计算:

def slope(x:pl.Expr, y: pl.Expr) -> pl.Expr:
    """
    Calculate the slope of a linear regression line between x and y.

    Parameters
    ----------
    x : pl.Expr
        The x values of the linear regression line.
    y : pl.Expr
        The y values of the linear regression line.

    Returns
    -------
    pl.Expr
        The slope of the linear regression line.
    """
    return (pl.corr(x, y) * pl.std(y)) / pl.std(x)

希望有一天我们可以计算出与此类似的rollingslope,而不是添加太多的rolling*函数:

df.with_columns([
    pl.rolling(period).apply(
    slope(pl.col(col_A),pl.col(col_B))
    )
])

AttributeError: module 'polars' has no attribute 'rolling'

MarcoGorelli commented 8 months ago

I think that was just an example of desired syntax

char101 commented 1 month ago

To add another alternative using https://github.com/azmyrajab/polars_ols

df.with_row_index('x').with_columns(
    slope=pl.col.y.least_squares.rolling_ols('x', window_size=window, mode='coefficients').struct[0],
)

Even simpler with https://github.com/Yvictor/polars_ta_extension

df.with_columns(
    slope=pl.col.y.ta.linearreg_slope(window),
)