pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.36k stars 1.96k forks source link

`rolling`: add grouping by rows (no index) #12014

Open Julian-J-S opened 1 year ago

Julian-J-S commented 1 year ago

Description

the current rolling function allows for lots of cool analysis but as far as I can tell the basic use case of rolling over rows without having any index is not supported?

Example:

DATA = {
    "a": [1, 2, 6, 7],
}

Pandas

(
    pd.DataFrame(DATA)
    .rolling(2)
    .agg(['sum', 'max'])
)

Polars

(
    pl.DataFrame(DATA)
    .with_row_count()                    # an index colum is currently required and needs to be sorted
    .cast({"row_nr": pl.Int64})          # default u32 type not supported therefore casting
    .rolling(
        index_column="row_nr",           # I dont want an index col but required: ideally None (default)
        period="2i",                     # no way to group by rows directly for example: 2 or "2r" for 2 rows
    )
    .agg(
        sum=pl.sum("a"),
        max=pl.max("a"),
    )
)

As you can see it is very verbose to achieve this basic functionaliy. I also saw that there are functions like rolling_sum/rolling_<...>) that support windows of fixed rows as well as temporal. It would be great to bring the same functionality to the more generic rolling function.

Desired solution

(
    pl.DataFrame(DATA)
    .rolling(
        period=2, # or "2r" for 2 rows? (also in `rolling_xxx` the same parameter is called `window_size`)
    )
    .agg(
        sum=pl.sum("a"),
    )
)
orlp commented 1 year ago

Yes, I think this should definitely be possible.

@ritchie46 Why does .rolling() not use the same argument structure as the other rolling_ functions? They have the window size / period as the first argument, with by as an optional keyword argument.

In fact the documentation for .rolling() still references the (non-existing) by argument.

xyk2000 commented 1 year ago

Yes, I think this should definitely be possible.

@ritchie46 Why does .rolling() not use the same argument structure as the other rolling_ functions? They have the window size / period as the first argument, with by as an optional keyword argument.

In fact the documentation for .rolling() still references the (non-existing) by argument.

I also agree that the rolling method should support directly specifying the window size, making it more versatile, for example, it can be used in eval context

pl.col.xxx.list.eval(pl.element().mean().rolling(10))
ritchie46 commented 1 year ago

Yes, I think this should definitely be possible.

@ritchie46 Why does .rolling() not use the same argument structure as the other rolling_ functions? They have the window size / period as the first argument, with by as an optional keyword argument.

In fact the documentation for .rolling() still references the (non-existing) by argument.

Because the rolling_ functions are specialized implementations that support this.

The rolling postfix can run all expressions that are before this and thus we cannot guarantee we can run the specialized behavior. The rolling postfix is restricted to what we can support at the moment.

And @orlp I think we are now conflating the expr rolling and Lazyframe rolling.

ritchie46 commented 1 year ago

In any case, we should accept an expression as index_column. Then you can pass pl.arange(0, pl.count()).

Julian-J-S commented 1 year ago

I have been thinking about this a lot and I think this functionality does NOT fit in the current rolling concept and is closer to the group_by_dynamic function. (maybe we should rethink the naming here)

Let me try to give an overview and comparison of rolling (polars), group_by_dynamic (polars) and rolling (pandas).

rolling (polars)

group_by_dynamic (polars)

rolling (pandas, only looking at functionality for integer-windows, supports more!)

I think the integer-window functionality of pandas is very useful and should be implemented in polars. It could get its own function or be added to group_by_dynamic

Sidenote: imo it is currently a bit confusing that rolling and group_by_dynamic are related but have very different names and pandas rolling might be closer to group_by_dynamic than rolling in polars. Also I dont understand why group_by_dynamic is called dynamic, because it is not dynamic at all. It is just a fixed size window with equidistant time steps.