Open hleumas opened 1 year ago
Honestly I feel the entire rolling_
API could use a redesign:
We really shouldn't have 20 different rolling_
methods. My proposal would be to treat interval-based rolling as well as index-based rolling as a kind of Series
window function (because that's what they really are - window functions).
I would propose two new methods, rolling
for index-based rolling windows, and rolling_interval
for interval-based rolling windows. They would support the following parameters:
Expr.rolling(
window_size: int | (int, int),
weights: List[float] | None,
min_periods: int | None = None,
*,
by: str | None = None,
closed: ClosedInterval = 'left',
) → Self
Expr.rolling_interval(
window_size: int | timedelta | str | (int, int) | (timedelta, timedelta) | (str, str),
min_periods: int | None = None,
*,
by: str | None = None,
closed: ClosedInterval = 'both',
) → Self
I removed the center
option because I think it should be the default for the rolling_interval
, and off by default for rolling
.
If the user wants different behavior they can specify an arbitrary double-ended window size.
I would also perhaps include a new pad
parameter to rolling
(but not rolling_interval
) that can be either None
(the current default - out of bounds = null), "repeat"
(repeat the first/last value), or a literal value to pad with.
So for your example instead of
pl.col('y').rolling_mean(window_size=2.0, by='x')
we would do
pl.col('y').mean().rolling_interval(2.0, by='x')
If there is a good way to abstract this on the physical side without recomputing the mean for every window, I am all for this.
Or were you thinking of having a naive physical implementation (calling the expression function for every window) and then let the optimizer choose the physical fast paths when available (e.g. mean, min, max). etc. This is what we currently do with groupby_rolling
.
Or were you thinking of having a naive physical implementation (calling the expression function for every window) and then let the optimizer choose the physical fast paths when available (e.g. mean, min, max). etc.
Yes, this is what I was thinking.
Would it be possible to size the window based on another column or expr, e.g.:
Expr.rolling(
window_size: int | (int, int) | Expr, # see here, size determined by another column
weights: List[float] | None,
min_periods: int | None = None,
*,
by: str | None = None,
closed: ClosedInterval = 'left',
) → Self
@mcrumiller I would not want to add that in the initial version, because it would require a completely dedicated algorithm (using a Fenwick tree or similar cumulative data structure) to not get O(n^2) performance.
@hleumas @orlp thanks a lot,this is what I want,I totally agree that we don't need so many rolling*,except there has special algorithm for performance improvement,but '''pl.col('y').mean().rolling_interval(2.0, by='x')''' this api looks a little strange,by the way if possible to design by's parameter as list?
Problem description
I am calculating rolling mean of a function y(x). I want to specify the
window
in terms ofΔx
, which works fine as long as x isdatetime
However, my usecase is for
x
being generic float. I don't really see a practical reason for limiting window sizes to timedeltas.Surely, I can simulate the desired behaviour by artificially creating some fake datetime based on
x
, but, that seems bit silly.What do you think?
Example