pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

Cannot use `.list` namespace functions in GroupBy context #14538

Open mcrumiller opened 7 months ago

mcrumiller commented 7 months ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame({
    "A": [1, 1, 2, 2],
    "B": [1, 2, 3, 4],
})
df.group_by("A").agg(pl.col("B").list.to_array(2))

Log output

polars.exceptions.ComputeError: expected List dtype

Error originated just after this operation:
DF ["A", "B"]; PROJECT */2 COLUMNS; SELECTION: "None"

Issue description

When aggregating a column in a GroupBy context, the list namespace is not accessible.

Expected behavior

Should be able to call .list namespace functions.

Installed versions

``` --------Version info--------- Polars: 0.20.8 Index type: UInt32 Platform: Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.12.1 (main, Jan 31 2024, 09:51:46) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: 0.9.0 cloudpickle: 3.0.0 connectorx: deltalake: 0.15.3 fsspec: 2023.12.2 gevent: 24.2.1 hvplot: 0.9.2 matplotlib: 3.8.2 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 15.0.0 pydantic: 2.6.1 pyiceberg: 0.5.1 pyxlsb: 1.0.10 sqlalchemy: 2.0.27 xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
Julian-J-S commented 7 months ago

yes, I was looking for a good solution here but I think there is a "problem" in polars in the agg context which can be solved in two ways:

  1. more "hacks" (make it somehow work in egg context)
  2. rethink the automatic list conversion if you use pl.col(...) which is a big problem imo

Auto list cast of pl.col(...) in agg context

This has often been discussed and I personally think this is a bad idea. This "bug" is just one example

The user should be responsible for bringing the data in the aggregation down to a single line. I think auto casting pl.col to a list is problematic in many ways, one example is this issue.

The problem is a confusing inconsistency between the select and agg context. It is very similar in many ways but then has some quirks like this which makes it confusing. Why not make them identical in terms of user experience?

example:

df = pl.DataFrame({"A": [1, 1, 1], "B": [1, 2, 4]})

Select

df.select(
    min=pl.col("B").min(),
    max=pl.col("B").max(),
    mean=pl.col("B").mean(),
    implode=pl.col("B").implode(),                         # explicit `implode` required!
    implode_array=pl.col("B").implode().list.to_array(3),  # ".list" namespace available
)
shape: (1, 5)
┌─────┬─────┬──────────┬───────────┬───────────────┐
│ min ┆ max ┆ mean     ┆ implode   ┆ implode_array │
│ --- ┆ --- ┆ ---      ┆ ---       ┆ ---           │
│ i64 ┆ i64 ┆ f64      ┆ list[i64] ┆ array[i64, 3] │
╞═════╪═════╪══════════╪═══════════╪═══════════════╡
│ 1   ┆ 4   ┆ 2.333333 ┆ [1, 2, 4] ┆ [1, 2, 4]     │
└─────┴─────┴──────────┴───────────┴───────────────┘

group_by / agg

df.group_by("A").agg(
    min=pl.col("B").min(),
    max=pl.col("B").max(),
    mean=pl.col("B").mean(),
    col=pl.col("B"),  # imo this should not work! No auto cast to list
    implode=pl.col("B").implode(),  # this should create a list (not list of list)
    implode_array=pl.col("B").implode().list.to_array(3),  # this should work like in select
)
shape: (1, 7)
┌─────┬─────┬─────┬──────────┬───────────┬─────────────────┬─────────────────────┐
│ A   ┆ min ┆ max ┆ mean     ┆ col       ┆ implode         ┆ implode_array       │
│ --- ┆ --- ┆ --- ┆ ---      ┆ ---       ┆ ---             ┆ ---                 │
│ i64 ┆ i64 ┆ i64 ┆ f64      ┆ list[i64] ┆ list[list[i64]] ┆ list[array[i64, 3]] │
╞═════╪═════╪═════╪══════════╪═══════════╪═════════════════╪═════════════════════╡
│ 1   ┆ 1   ┆ 4   ┆ 2.333333 ┆ [1, 2, 4] ┆ [[1, 2, 4]]     ┆ [[1, 2, 4]]         │
└─────┴─────┴─────┴──────────┴───────────┴─────────────────┴─────────────────────┘