Open MariusMerkleQC opened 1 month ago
There is Expr.rolling
df_fake.with_columns(
pl.col("value").mean().rolling(index_column="timestamp", period="5y")
.alias("cumulative_mean")
)
# shape: (3, 4)
# ┌─────────────────────┬───────┬─────┬─────────────────┐
# │ timestamp ┆ value ┆ key ┆ cumulative_mean │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ datetime[μs] ┆ i32 ┆ str ┆ f64 │
# ╞═════════════════════╪═══════╪═════╪═════════════════╡
# │ 2023-01-01 00:00:00 ┆ 1 ┆ a ┆ 1.0 │
# │ 2024-01-01 00:00:00 ┆ 2 ┆ b ┆ 2.0 │
# │ 2024-01-01 00:00:00 ┆ 3 ┆ c ┆ 2.0 │
# └─────────────────────┴───────┴─────┴─────────────────┘
And dedicated rolling aggs: Expr.rolling_mean_by
df_fake.with_columns(
pl.col("value").rolling_mean_by("timestamp", window_size="5y")
.alias("cumulative_mean")
)
# shape: (3, 4)
# ┌─────────────────────┬───────┬─────┬─────────────────┐
# │ timestamp ┆ value ┆ key ┆ cumulative_mean │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ datetime[μs] ┆ i32 ┆ str ┆ f64 │
# ╞═════════════════════╪═══════╪═════╪═════════════════╡
# │ 2023-01-01 00:00:00 ┆ 1 ┆ a ┆ 1.0 │
# │ 2024-01-01 00:00:00 ┆ 2 ┆ b ┆ 2.0 │
# │ 2024-01-01 00:00:00 ┆ 3 ┆ c ┆ 2.0 │
# └─────────────────────┴───────┴─────┴─────────────────┘
Yes, but I'm shying away from using these as they are considered unstable...
Description
Problem
When using
pl.DataFrame.rolling()
, it is only possible to compute aggregated values, but sometimes I just like to keep a certain column.Example
Imagine that I have a fake data in
df_fake
. I would like to compute apl.DataFrame
which looks just that keeps the columns "timestamp" and "key", but computes the cumulative mean up to that point in time. This is not possible using.rolling()
because there is no operation which just keeps the element. Using.last()
, as shown below, fails if there are equal values in theindex_column
. The only way I manage to work around this is by horizontally concatenating a part of the originaldf_fake
to the rolled data frame, which doesn't look nice at all.Suggestion
What about introducing an optional argument
keep_cols: list[str]
that just keeps the columns as they are in the originaldf_fake
, yet they don't get lost in the.rolling()
operation?