pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.2k stars 1.95k forks source link

map_groups has inconsistent behavior between lazy and eager. It should also accept expressions. #13442

Open deanm0000 opened 10 months ago

deanm0000 commented 10 months ago

Checks

Reproducible example

df=pl.DataFrame({
    'a':[1,1,2,2,3,3],
    'b':range(6)
})
df.group_by(pl.col('a')).map_groups(lambda df: df)
# TypeError: cannot call `map_groups` when grouping by an expression
df.lazy().group_by(pl.col('a')).map_groups(lambda df: df, schema=df.schema).collect()
# seems to work
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
│ 3   ┆ 5   │
│ 1   ┆ 0   │
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 2   ┆ 3   │
└─────┴─────┘

But the gotcha is if the expression in the group_by makes a change then those changes aren't visible to the function such as

print(df.lazy().group_by(pl.col('a')*2).map_groups(lambda df: df, schema=df.schema).collect())
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
│ 2   ┆ 3   │
│ 1   ┆ 0   │
│ 1   ┆ 1   │
│ 3   ┆ 4   │
│ 3   ┆ 5   │
└─────┴─────┘

Log output

No response

Issue description

group_by with agg just works, of course

Expected behavior

We can make map_groups work by having it do a with_columns before dispatching self.lgb.map_groups so essentially have it implicitly convert the above into:

(
    df.lazy()
    .with_columns(pl.col('a')*2)
    .group_by('a')
    .map_groups(lambda df: df, schema=df.schema)
    .collect()
    )
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 2   │
│ 4   ┆ 3   │
│ 6   ┆ 4   │
│ 6   ┆ 5   │
│ 2   ┆ 0   │
│ 2   ┆ 1   │
└─────┴─────┘

Installed versions

``` --------Version info--------- Polars: 0.20.3 Index type: UInt32 Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31 Python: 3.11.7 (main, Dec 19 2023, 20:42:30) [GCC 10.2.1 20210110] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: 0.3.2 deltalake: fsspec: 2023.10.0 gevent: hvplot: matplotlib: 3.8.1 numpy: 1.26.1 openpyxl: 3.1.2 pandas: 2.1.2 pyarrow: 14.0.1 pydantic: 2.5.2 pyiceberg: pyxlsb: 1.0.10 sqlalchemy: 2.0.23 xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
barak1412 commented 10 months ago

It seems like the eager version is implemented differently, and limited to string tye only in the grouping:

pub fn group_by_map_groups(
        &self,
        by: Vec<&str>,
        lambda: PyObject,
        maintain_order: bool,
    ) -> PyResult<Self>