Open supermarin opened 1 year ago
@ritchie46 ping in case this got lost in the issues. Let me know if you're too busy to add it and I might try to take a stab
here's one way you could do it manually for now, if your points are evenly spaced:
In [63]: def slope(x, y):
...: numerator = ((x - x.mean())*(y - y.mean())).sum()
...: denominator = ((x - x.mean())**2).sum()
...: return (numerator/denominator).alias('slope')
...:
In [64]: df.group_by_rolling(pl.col('ts'), period='5d').agg(slope(pl.col('ts').cast(pl.Int64), pl.col('n')))
Out[64]:
shape: (367, 2)
┌────────────┬───────────┐
│ ts ┆ slope │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════════╡
│ 2020-01-01 ┆ NaN │
│ 2020-01-02 ┆ -0.502773 │
│ 2020-01-03 ┆ -0.530439 │
│ 2020-01-04 ┆ -0.248335 │
│ … ┆ … │
│ 2020-12-29 ┆ 0.231307 │
│ 2020-12-30 ┆ 0.544151 │
│ 2020-12-31 ┆ 0.33194 │
│ 2021-01-01 ┆ -0.007023 │
└────────────┴───────────┘
Eyeballing the last point, it seems to match scipy.stats:
In [54]: stats.linregress(np.arange(5), df['n'].tail(5)).slope
Out[54]: -0.007022641517739636
@MarcoGorelli thanks! Need to measure the performance of agg(),, but using .apply() with a python function performed significantly slower than converting the whole DF to Pandas and doing vectorized computation with numpy.
Yeah apply is significantly slower and should only be a last resort
Could you share your code of how you achieved this though please?
@MarcoGorelli hmm. Didn't see this code in a while (before I filed this issue), and upon returning to it, it looks like I was wrong: I'm still using .rolling().apply() but in Pandas.
In any case, this was measurably faster than hand rolled Polars-only solution when I wrote it.
It's the bottleneck of the algo it's running inside, and a vectorized solution like .rolling_corr
would speed it up a bunch.
Speaking of it, since these are very similar functions, I think having something like rolling_lineregress
that exposes the same params as scipy's lineregress would be even better since I'm doing 2-3 of these computations one after the other.
This is a snippet from what I'm running ATM:
a = pl.DataFrame(...)
# TODO: figure out how to do this fast without Pandas.
b = a.to_pandas()
b["slope"] = (
b.groupby("ticker")["log"]
.rolling(90)
.apply(_compute_slope)
.reset_index(0, drop=True)
)
c = (
pl.from_pandas(b)
.with_columns(...)
)
# Linear regression
# Param data: pd.DataFrame
def _compute_slope(data):
return np.polyfit(data.index.values, data.values, 1)[0]
thanks - how does the polars agg solution I posted above compare with pandas rolling apply?
Instead of adding yet another specific function I would rather see future automatic in-engine optimizations to compute rolling sums/means and such efficiently automatically. Ideally @MarcoGorelli's solution should be translated automatically to a fast implementation.
thanks - how does the polars agg solution I posted above compare with pandas rolling apply?
@MarcoGorelli not sure I'm able to get it working because my X axis contains a series of numbers (1,2,3...n) and not dates. As far as I understand, group_by_rolling and group_by_dynamic only work with datetime.
It doesn't have to be a datetime - check the docs:
In case of a group_by_rolling on an integer column, the windows are defined by:
“1i” # length 1 “10i” # length 10
(having said that, the docs could probably do with an example of this...)
@MarcoGorelli sorry for getting back to you so late on this. I achieved a 4x speedup using .rolling().agg(slope) compared to converting to Pandas & using Numpy which is amazing! Will try to catch time to benchmark this workaround vs .rolling_corr (it's almost the same function so I expect it to perform similarly) and will report here.
This was not super trivial since I had one more dimension to group over, and using .over() in aggregations is not supported in Polars yet. There's also lack of by=
in Expr.rolling compared to DataFrame.rolling, so couldn't use that either.
After many trial&error attempts, I'm not super happy with how I made it work: needed to store the first dataframe into a variable, compute slope & store in a second df, then join them.
Agree with @orlp , and also had one more thought: since these are quite common operations, do you think it would be viable to have a method like lineregress
from scipy?
For example, in order to compute the r-value (rolling_corr), you need to compute all elements of computing slope anyways.
This way we're ditching free computation that was already performed and doing it all over again.
In fact slope can compute like this:
def slope(x:pl.Expr, y: pl.Expr) -> pl.Expr:
"""
Calculate the slope of a linear regression line between x and y.
Parameters
----------
x : pl.Expr
The x values of the linear regression line.
y : pl.Expr
The y values of the linear regression line.
Returns
-------
pl.Expr
The slope of the linear regression line.
"""
return (pl.corr(x, y) * pl.std(y)) / pl.std(x)
hope one day we can compute rollingslope similar to this instead of adding too much rolling* function:
df.with_columns([
pl.rolling(period).apply(
slope(pl.col(col_A),pl.col(col_B))
)
])
事实上斜率可以这样计算:
def slope(x:pl.Expr, y: pl.Expr) -> pl.Expr: """ Calculate the slope of a linear regression line between x and y. Parameters ---------- x : pl.Expr The x values of the linear regression line. y : pl.Expr The y values of the linear regression line. Returns ------- pl.Expr The slope of the linear regression line. """ return (pl.corr(x, y) * pl.std(y)) / pl.std(x)
希望有一天我们可以计算出与此类似的rollingslope,而不是添加太多的rolling*函数:
df.with_columns([ pl.rolling(period).apply( slope(pl.col(col_A),pl.col(col_B)) ) ])
AttributeError: module 'polars' has no attribute 'rolling'
I think that was just an example of desired syntax
To add another alternative using https://github.com/azmyrajab/polars_ols
df.with_row_index('x').with_columns(
slope=pl.col.y.least_squares.rolling_ols('x', window_size=window, mode='coefficients').struct[0],
)
Even simpler with https://github.com/Yvictor/polars_ta_extension
df.with_columns(
slope=pl.col.y.ta.linearreg_slope(window),
)
Problem description
Following up on https://github.com/pola-rs/polars/issues/5493,
.rolling_cov()
and.rolling_corr()
were added (thank you!), what's still missing is.rolling_slope()
. Would highly appreciate if you could add that one as well.To remove ambiguity, I'm referring to a rolling slope of a regression line: https://support.microsoft.com/en-us/office/slope-function-11fb8f97-3117-4813-98aa-61d7e01276b9