Open mmcdermott opened 5 months ago
hi @mmcdermott - to expedite resolution, could you please include a simple snippet which shows something like:
I'm starting with medical data, where I have columns: "id", "timestamp", ...data_columns..., boundary_column
, where boundary_column
is a boolean. Grouped by id
, I would like to aggregate over data_columns
between each row of my dataframe and either the immediately subsequent or prior row where boundary_column
is True
, subject to some potential temporal offset evaluated over the "timestamp"
column.
Here is the API I would roughly like, realized through a function that implements this capability for a limited set of aggregations in pure python with doctest examples in the docstring. This function can be found here: https://github.com/justin13601/ACES/blob/174d6b6522e7220b1a65f2279f9411a089873ede/src/aces/aggregate.py#L370
def boolean_expr_bound_sum(
df: pl.DataFrame,
boundary_expr: pl.Expr,
mode: str,
closed: str,
offset: timedelta = timedelta(0),
) -> pl.DataFrame:
"""Sums all columns of ``df`` between each row plus an offset and the next per-subject satisfying event.
TODO: Use https://hypothesis.readthedocs.io/en/latest/quickstart.html to test this function.
Performs a boolean-expression-bounded summation over the columns of ``df``. The logic of this is as
follows.
- If ``mode`` is ``bound_to_row``, then that means that the _left_ endpoint of the
window must correspond to an instance where ``boundary_expr`` is `True` and the _right_ endpoint will
correspond to a row in the dataframe.
- If ``mode`` is ``row_to_bound``, then that means that the _right_ endpoint of the
window must correspond to an instance where ``boundary_expr`` is `True` and the _left_ endpoint will
correspond to a row in the dataframe.
- If ``closed`` is ``'none'``, then neither the associated left row endpoint nor the associated right
row endpoint will be included in the summation.
- If ``closed`` is ``'both'``, then both the associated left row endpoint and the associated right
row endpoint will be included in the summation.
- If ``closed`` is ``'left'``, then only the associated left row endpoint will be included in the
summation.
- If ``closed`` is ``'right'``, then only the associated right row endpoint will be included in the
summation.
- The output dataframe will have the same number of rows and order as the input dataframe. Each row will
correspond to the aggregation over the matching window that uses that row as the "row" side of the
terminating aggregation (regardless of whether that corresponds to the left or the right endpoint, and
regardless of whether that endpoint would actually be included in the calculation based on
``closed``). The associated ``boundary_expr`` side of the endpoint will be the nearest possible
``boundary_expr`` endpoint that produces a non-empty set of rows over which to aggregate. Note that
this may depend on the value of ``closed`` -- e.g., a row with ``boundary_expr`` being `True` can
produce a one-element set of rows to aggregate if ``closed`` is ``'both'``, but this is not possible
if ``closed`` is something else, and in that case that same row may instead rely on a different row to
fill its ``boundary_expr`` endpoint when evaluated as a "row endpoint" during calculation.
- Offset further modifies this logic by applying a temporal offset of fixed size to the "row" endpoint.
All other logic stays the same.
In particular, suppose that we have following rows and boolean boundary expression evaluations (for a
single subject):
Rows: [0, 1, 2, 3, 4, 5, 6]
Boundary Expression: [False, True, False, True, True, False, False]
```
Then, we would aggregate the following rows under the following specified conditions:
```
mode | closed | aggregate_groups
bound_to_row | both | [], [1], [1, 2], [3], [4], [4, 5], [4, 5, 6]
bound_to_row | left | [], [], [1], [1, 2], [3], [4], [4, 5]
bound_to_row | right | [], [], [2], [2, 3], [4], [5], [5, 6]
bound_to_row | none | [], [], [], [2], [], [], [5]
-------------|--------|------------------------------------------------------
row_to_bound | both | [0, 1], [1], [2, 3], [3], [4], [], []
row_to_bound | left | [0], [], [2], [], [], [], []
row_to_bound | right | [1], [2, 3], [3], [4], [], [], []
row_to_bound | none | [], [2], [], [], [], [], []
```
How to think about this? the ``closed`` parameter controls where we put the endpoints to merge for our
bounds. Consider the case accented with ** above, where we have the row being row 1 (indexing from 0), we
are in a ``row_to_bound`` setting, and we have ``closed = "none"``. For this, as the left endpoint (here
the row, as we are in ``row_to_bound`` is not included, so our boundary on the left is between row 2 and
3. Our boundary on the right is just to the left of the next row which has the True boolean value. In this
case, that means between row 2 and 3 (as row 1 is to the left of our left endpoint). In contrast, if
closed were ``"left"``, then row 1 would be a possible bound row to use on the right, and thus by virtue
of our right endpoint being open, we would include no rows.
Args:
df: The dataframe to be aggregated. The input must be sorted in ascending order by
timestamp within each subject group. It must contain the following columns:
- A column ``subject_id`` which contains the subject ID.
- A column ``timestamp`` which contains the timestamp at which the event contained in any given
row occurred.
- A set of other columns that will be summed over the specified windows.
boundary_expr: A boolean expression which can be evaluated over the passed dataframe, and defines rows
that can serve as valid left or right boundaries for the aggregation step. The precise window over
which the dataframe will be aggregated will depend on ``mode`` and ``closed``. See above for an
explanation of the various configurations possible.
mode:
closed:
offset:
Returns:
The dataframe that has been aggregated in an event bound manner according to the specified
``endpoint_expr``. This aggregation means the following:
- The output dataframe will contain the same number of rows and be in the same order (in terms of
``subject_id`` and ``timestamp``) as the input dataframe.
- The column names will also be the same as the input dataframe, plus there will be two additional
column, ``timestamp_at_end`` that contains the timestamp of the end of the aggregation window for
each row (or null if no such aggregation window exists) and ``timestamp_at_start`` that contains
the timestamp at the start of the aggregation window corresponding to the row.
- The values in the predicate columns of the output dataframe will contain the sum of the values
over the permissible row ranges given the input parameters. See above for an explanation.
Examples:
>>> import polars as pl
>>> _ = pl.Config.set_tbl_width_chars(150)
>>> from datetime import datetime
>>> df = pl.DataFrame({
... "subject_id": [1, 1, 1, 2, 2, 2, 2, 2],
... "timestamp": [
... # Subject 1
... datetime(year=1989, month=12, day=1, hour=12, minute=3),
... datetime(year=1989, month=12, day=3, hour=13, minute=14), # HAS EVENT BOUND
... datetime(year=1989, month=12, day=5, hour=15, minute=17),
... # Subject 2
... datetime(year=1989, month=12, day=2, hour=12, minute=3),
... datetime(year=1989, month=12, day=4, hour=13, minute=14),
... datetime(year=1989, month=12, day=6, hour=15, minute=17), # HAS EVENT BOUND
... datetime(year=1989, month=12, day=8, hour=16, minute=22),
... datetime(year=1989, month=12, day=10, hour=3, minute=7), # HAS EVENT BOUND
... ],
... "idx": [0, 1, 2, 3, 4, 5, 6, 7],
... "is_A": [1, 0, 1, 1, 1, 1, 0, 0],
... "is_B": [0, 1, 0, 1, 0, 1, 1, 1],
... "is_C": [0, 1, 0, 0, 0, 1, 0, 1],
... })
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "both",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-05 15:17:00 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-06 15:17:00 ┆ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-08 16:22:00 ┆ 2 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "none",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-05 15:17:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-06 15:17:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-08 16:22:00 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 2 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "left",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-05 15:17:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-06 15:17:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-08 16:22:00 ┆ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "right",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-05 15:17:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-06 15:17:00 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-08 16:22:00 ┆ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 3 ┆ 2 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "both",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 2 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-06 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 3 ┆ 2 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "none",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-06 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "left",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 2 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-06 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "right",
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-06 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> #### WITH OFFSET ####
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "both",
... offset = timedelta(days=3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-04 12:03:00 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-06 13:14:00 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-08 15:17:00 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-05 12:03:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 13:14:00 ┆ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-09 15:17:00 ┆ 2 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-11 16:22:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-13 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "left",
... offset = timedelta(days=3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-04 12:03:00 ┆ 0 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-06 13:14:00 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 1989-12-08 15:17:00 ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-05 12:03:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 13:14:00 ┆ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-09 15:17:00 ┆ 2 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-11 16:22:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 1989-12-13 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "none",
... timedelta(days=-3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-05 16:22:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 03:07:00 ┆ 1 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "bound_to_row",
... "right",
... offset = timedelta(days=-3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-05 16:22:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 03:07:00 ┆ 1 ┆ 1 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "both",
... offset = timedelta(days=3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-05 12:03:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 3 ┆ 2 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-09 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "left",
... offset = timedelta(days=3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-12-05 12:03:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-07 13:14:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-09 15:17:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ null ┆ null ┆ 0 ┆ 0 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "none",
... offset = timedelta(days=-3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-11-28 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-11-30 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-02 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-11-29 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-01 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-03 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 0 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-05 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-07 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 1 ┆ 0 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
>>> boolean_expr_bound_sum(
... df,
... pl.col("idx").is_in([1, 4, 7]),
... "row_to_bound",
... "right",
... offset = timedelta(days=-3),
... ).drop("idx")
shape: (8, 7)
┌────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────┬──────┬──────┐
│ subject_id ┆ timestamp ┆ timestamp_at_start ┆ timestamp_at_end ┆ is_A ┆ is_B ┆ is_C │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════╪══════╪══════╡
│ 1 ┆ 1989-12-01 12:03:00 ┆ 1989-11-28 12:03:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-03 13:14:00 ┆ 1989-11-30 13:14:00 ┆ 1989-12-03 13:14:00 ┆ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 1989-12-05 15:17:00 ┆ 1989-12-02 15:17:00 ┆ 1989-12-03 13:14:00 ┆ 0 ┆ 1 ┆ 1 │
│ 2 ┆ 1989-12-02 12:03:00 ┆ 1989-11-29 12:03:00 ┆ 1989-12-04 13:14:00 ┆ 2 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-04 13:14:00 ┆ 1989-12-01 13:14:00 ┆ 1989-12-04 13:14:00 ┆ 2 ┆ 1 ┆ 0 │
│ 2 ┆ 1989-12-06 15:17:00 ┆ 1989-12-03 15:17:00 ┆ 1989-12-04 13:14:00 ┆ 1 ┆ 0 ┆ 0 │
│ 2 ┆ 1989-12-08 16:22:00 ┆ 1989-12-05 16:22:00 ┆ 1989-12-10 03:07:00 ┆ 1 ┆ 3 ┆ 2 │
│ 2 ┆ 1989-12-10 03:07:00 ┆ 1989-12-07 03:07:00 ┆ 1989-12-10 03:07:00 ┆ 0 ┆ 2 ┆ 1 │
└────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────┴──────┴──────┘
"""
if mode not in ("bound_to_row", "row_to_bound"):
raise ValueError(f"mode {mode} invalid!")
if closed not in ("both", "none", "left", "right"):
raise ValueError(f"closed {closed} invalid!")
if offset != timedelta(0):
if offset > timedelta(0):
left_inclusive = False
if mode == "row_to_bound":
# Here, we'll be taking cumsum_at_bound - cumsum_at_row - aggd_over_offset
if closed in ("left", "both"):
right_inclusive = False
else:
right_inclusive = True
else:
# Here, we'll be taking cumsum_at_row - cumsum_at_bound + aggd_over_offset
if closed in ("right", "both"):
right_inclusive = True
else:
right_inclusive = False
else:
right_inclusive = False
if mode == "row_to_bound":
# Here, we'll be taking cumsum_at_bound - cumsum_at_row + aggd_over_offset
if closed in ("left", "both"):
left_inclusive = True
else:
left_inclusive = False
else:
# Here, we'll be taking cumsum_at_row - cumsum_at_bound - aggd_over_offset
if closed in ("right", "both"):
left_inclusive = False
else:
left_inclusive = True
aggd_over_offset = aggregate_temporal_window(
df,
TemporalWindowBounds(
left_inclusive=left_inclusive,
window_size=offset,
right_inclusive=right_inclusive,
offset=None,
),
)
cols = [c for c in df.columns if c not in {"subject_id", "timestamp"}]
cumsum_cols = {c: pl.col(c).cum_sum().over("subject_id").alias(f"{c}_cumsum_at_row") for c in cols}
df = df.with_columns(*cumsum_cols.values())
cumsum_at_boundary = {c: pl.col(f"{c}_cumsum_at_row").alias(f"{c}_cumsum_at_boundary") for c in cols}
# We need to adjust `cumsum_at_boundary` to appropriately include or exclude the boundary event.
if mode == "bound_to_row" and closed in ("left", "both"):
cumsum_at_boundary = {
c: (expr - pl.col(c)).alias(f"{c}_cumsum_at_boundary") for c, expr in cumsum_at_boundary.items()
}
elif mode == "row_to_bound" and closed not in ("right", "both"):
cumsum_at_boundary = {
c: (expr - pl.col(c)).alias(f"{c}_cumsum_at_boundary") for c, expr in cumsum_at_boundary.items()
}
timestamp_offset = pl.col("timestamp") - offset
if mode == "bound_to_row":
if closed in ("left", "both"):
timestamp_offset -= timedelta(seconds=1e-6)
else:
timestamp_offset += timedelta(seconds=1e-6)
fill_strategy = "forward"
sum_exprs = {
c: (
pl.col(f"{c}_cumsum_at_row")
- pl.col(f"{c}_cumsum_at_boundary").fill_null(strategy=fill_strategy).over("subject_id")
).alias(c)
for c in cols
}
if (closed in ("left", "none") and offset <= timedelta(0)) or offset < timedelta(0):
# If we either don't include the right endpoint due to the closed value and the lack of a positive
# offset or we have a negative offset, we want to remove the right endpoint's counts from the
# cumsum difference.
sum_exprs = {c: expr - pl.col(c) for c, expr in sum_exprs.items()}
else:
if closed in ("right", "both"):
timestamp_offset += timedelta(seconds=1e-6)
else:
timestamp_offset -= timedelta(seconds=1e-6)
fill_strategy = "backward"
sum_exprs = {
c: (
pl.col(f"{c}_cumsum_at_boundary").fill_null(strategy=fill_strategy).over("subject_id")
- pl.col(f"{c}_cumsum_at_row")
).alias(c)
for c in cols
}
if (closed in ("left", "both") and offset <= timedelta(0)) or offset < timedelta(0):
# If we either do include the left endpoint due to the closed value and the lack of a positive
# offset or we have a negative offset, we want to add in the left endpoint's (the row's) counts
# As they will always be included.
sum_exprs = {c: expr + pl.col(c) for c, expr in sum_exprs.items()}
at_boundary_df = df.filter(boundary_expr).select(
"subject_id",
pl.col("timestamp").alias("timestamp_at_boundary"),
timestamp_offset.alias("timestamp"),
*cumsum_at_boundary.values(),
pl.lit(False).alias("is_real"),
)
with_at_boundary_events = (
pl.concat([df.with_columns(pl.lit(True).alias("is_real")), at_boundary_df], how="diagonal")
.sort(by=["subject_id", "timestamp"])
.select(
"subject_id",
"timestamp",
pl.col("timestamp_at_boundary").fill_null(strategy=fill_strategy).over("subject_id"),
*sum_exprs.values(),
"is_real",
)
.filter("is_real")
.drop("is_real")
)
if mode == "bound_to_row":
st_timestamp_expr = pl.col("timestamp_at_boundary")
end_timestamp_expr = pl.when(pl.col("timestamp_at_boundary").is_not_null()).then(
pl.col("timestamp") + offset
)
else:
st_timestamp_expr = pl.when(pl.col("timestamp_at_boundary").is_not_null()).then(
pl.col("timestamp") + offset
)
end_timestamp_expr = pl.col("timestamp_at_boundary")
if offset == timedelta(0):
return with_at_boundary_events.select(
"subject_id",
"timestamp",
st_timestamp_expr.alias("timestamp_at_start"),
end_timestamp_expr.alias("timestamp_at_end"),
*(pl.col(c).cast(PRED_CNT_TYPE).fill_null(0).alias(c) for c in cols),
)
if mode == "bound_to_row" and offset > timedelta(0):
def agg_offset_fn(c: str) -> pl.Expr:
return pl.col(c) + pl.col(f"{c}_in_offset_period")
elif mode == "bound_to_row" and offset < timedelta(0):
def agg_offset_fn(c: str) -> pl.Expr:
return pl.col(c) - pl.col(f"{c}_in_offset_period")
elif mode == "row_to_bound" and offset > timedelta(0):
def agg_offset_fn(c: str) -> pl.Expr:
return pl.col(c) - pl.col(f"{c}_in_offset_period")
elif mode == "row_to_bound" and offset < timedelta(0):
def agg_offset_fn(c: str) -> pl.Expr:
return pl.col(c) + pl.col(f"{c}_in_offset_period")
else:
raise ValueError(f"mode {mode} and offset {offset} invalid!")
return with_at_boundary_events.join(
aggd_over_offset,
on=["subject_id", "timestamp"],
how="left",
suffix="_in_offset_period",
).select(
"subject_id",
"timestamp",
st_timestamp_expr.alias("timestamp_at_start"),
end_timestamp_expr.alias("timestamp_at_end"),
*(agg_offset_fn(c).cast(PRED_CNT_TYPE, strict=False).fill_null(0).alias(c) for c in cols),
)
Description
I have a use case in polars where I need to summarize a dataframe of medical patient data by computing the sums of columns over dynamically sized windows of data that are bounded at the start or end by rows that have a given boolean expression evaluate to
True
and at the other endpoint by a row in a dataframe, potentially with a row-based or temporal offset. This is useful in my context because, for example, in addition to wanting to compute summary statistics for, for example, the last 7 days of a patient's data, I may also want to compute summary statistics of the period of a patient's data up until the start of their hospital admission, or their prior diagnosis of a certain kind, etc. The offset is useful for when I want to summarize, for example, only up to the start of an admission that occurred more than 24 hours ago, for example.It is possible to do this efficiently with a manipulation of cumulative sums and a bit of event timestamp fudging to handle offsets correctly. In particular, I have an implementation in python leveraging the polars python API here: https://github.com/justin13601/ESGPTTaskQuerying/blob/33c6ebbb62ce1b211ae39fb4ee84c4d7c09ec025/src/esgpt_task_querying/aggregate.py#L370C1-L1014C6
Note the code block linked above not only contains a working example in python, but also an extensive docstring with doctests outlining how (my current version) functions, to make this request more concrete.
But, it seems to me that this capability would be useful in other settings as well, and could be a good core capability within polars. I suspect the implementation internally would look similar to how
groupy_rolling
is implemented, so I thought I would post about it here in case others have a similar request (or in case there is a more natural way to do this that I'm missing.