pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.38k stars 1.97k forks source link

Allow for non-temporal columns in `by` for `rolling_mean` #10477

Open hleumas opened 1 year ago

hleumas commented 1 year ago

Problem description

I am calculating rolling mean of a function y(x). I want to specify the window in terms of Δx, which works fine as long as x is datetime

However, my usecase is for x being generic float. I don't really see a practical reason for limiting window sizes to timedeltas.

Surely, I can simulate the desired behaviour by artificially creating some fake datetime based on x, but, that seems bit silly.

What do you think?

Example

>>> df = pl.DataFrame({
...:   'x': [0.0, 1.0, 2.0, 4.0, 5.8],
...:   'y': [0, 1, 2, 3, 4],
...:   })

>>> df.select(mean = pl.col('y').rolling_mean(window_size=2.0, by='x'))
PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `Datetime`, got `i64`"))
orlp commented 1 year ago

Honestly I feel the entire rolling_ API could use a redesign:

We really shouldn't have 20 different rolling_ methods. My proposal would be to treat interval-based rolling as well as index-based rolling as a kind of Series window function (because that's what they really are - window functions).

I would propose two new methods, rolling for index-based rolling windows, and rolling_interval for interval-based rolling windows. They would support the following parameters:

Expr.rolling(
    window_size: int | (int, int),
    weights: List[float] | None,
    min_periods: int | None = None,
    *,
    by: str | None = None,
    closed: ClosedInterval = 'left',
) → Self

Expr.rolling_interval(
    window_size: int | timedelta | str | (int, int) | (timedelta, timedelta) | (str, str),
    min_periods: int | None = None,
    *,
    by: str | None = None,
    closed: ClosedInterval = 'both',
) → Self

I removed the center option because I think it should be the default for the rolling_interval, and off by default for rolling. If the user wants different behavior they can specify an arbitrary double-ended window size.

I would also perhaps include a new pad parameter to rolling (but not rolling_interval) that can be either None (the current default - out of bounds = null), "repeat" (repeat the first/last value), or a literal value to pad with.

So for your example instead of

pl.col('y').rolling_mean(window_size=2.0, by='x')

we would do

pl.col('y').mean().rolling_interval(2.0, by='x')
ritchie46 commented 1 year ago

If there is a good way to abstract this on the physical side without recomputing the mean for every window, I am all for this.

Or were you thinking of having a naive physical implementation (calling the expression function for every window) and then let the optimizer choose the physical fast paths when available (e.g. mean, min, max). etc. This is what we currently do with groupby_rolling.

orlp commented 1 year ago

Or were you thinking of having a naive physical implementation (calling the expression function for every window) and then let the optimizer choose the physical fast paths when available (e.g. mean, min, max). etc.

Yes, this is what I was thinking.

mcrumiller commented 1 year ago

Would it be possible to size the window based on another column or expr, e.g.:

Expr.rolling(
    window_size: int | (int, int) | Expr, # see here, size determined by another column
    weights: List[float] | None,
    min_periods: int | None = None,
    *,
    by: str | None = None,
    closed: ClosedInterval = 'left',
) → Self
orlp commented 1 year ago

@mcrumiller I would not want to add that in the initial version, because it would require a completely dedicated algorithm (using a Fenwick tree or similar cumulative data structure) to not get O(n^2) performance.

yfclark commented 1 year ago

@hleumas @orlp thanks a lot,this is what I want,I totally agree that we don't need so many rolling*,except there has special algorithm for performance improvement,but '''pl.col('y').mean().rolling_interval(2.0, by='x')''' this api looks a little strange,by the way if possible to design by's parameter as list?