pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.27k stars 1.85k forks source link

Eager/Lazy API alignment: LazyGroupBy vs DynamicGroupBy #17916

Open tritemio opened 1 month ago

tritemio commented 1 month ago

Description

The object returned by group_by_dynamic has different methods in lazy vs eager mode.

Specifically, the method group_by_dynamic returns LazyGroupBy for a LazyFrame and DynamicGroupBy for a eager DataFrame. These two objects have slightly different APIs.

So, for example it possible to call .len() on the lazy version but not in the eager version. Here an example:


from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": pl.datetime_range(
            start=datetime(2024, 6, 30),
            end=datetime(2024, 7, 16),
            interval="1d",
            eager=True,
        ),
    }
).with_columns(n=pl.int_range(pl.len()))

# len() in lazy mode: OK
x = df.lazy().group_by_dynamic("time", every="1w").len(name="count").collect()
print(x)
# shape: (4, 2)
# ┌─────────────────────┬───────┐
# │ time                ┆ count │
# │ ---                 ┆ ---   │
# │ datetime[μs]        ┆ u32   │
# ╞═════════════════════╪═══════╡
# │ 2024-06-24 00:00:00 ┆ 1     │
# │ 2024-07-01 00:00:00 ┆ 7     │
# │ 2024-07-08 00:00:00 ┆ 7     │
# │ 2024-07-15 00:00:00 ┆ 2     │
# └─────────────────────┴───────┘

# len() in eager mode: ERROR
x = df.group_by_dynamic("time", every="1w").len(name="count")
print(x)
# Traceback (most recent call last):
#   File "/Users/anto/src/poste/sda-poste-logistics/script/polars_bug_groupby_dynamic_len.py", line 26, in <module>
#     x = df.group_by_dynamic("time", every="1w").len(name="count")
#         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# AttributeError: 'DynamicGroupBy' object has no attribute 'len'

It appears to be a similar issue with the .reasample() method, as noted in the original report on SO: https://stackoverflow.com/questions/78714929/polars-group-by-dynamic-and-len-lazy-vs-eager?noredirect=1#comment138785003_78714929

Shouldn't these two APIs we aligned? Or this difference is a design choice?

MarcoGorelli commented 1 month ago

thanks @tritemio for the report - I think they should be aligned, will come back to this

ritchie46 commented 3 weeks ago

We should align the API. Eager should be able to do this.