pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

In aggregation, `drop_nulls().get(<expression>)` produces list instead of scalar #19363

Open Gattocrucco opened 1 week ago

Gattocrucco commented 1 week ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame(dict(a=[1,1,2,2], b=[6,7,8,9], c=[0,0,0,0]))

print(df.group_by('a').agg(pl.col('b').get(0))) # ok
print(df.group_by('a').agg(pl.col('b').get(pl.len() // 100))) # ok
print(df.group_by('a').agg(pl.col('b').drop_nulls().get(0))) # ok
print(df.group_by('a').agg(pl.col('b').drop_nulls().get(pl.len() // 100))) # ?????
print(df.group_by('a').agg(pl.col('b').drop_nulls().get(pl.col('c').first()))) # ?????
print(df.group_by('a').agg(pl.col('b').drop_nulls().implode().list.get(pl.len() // 100).first())) # workaround
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 8   │
└─────┴─────┘
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
│ 2   ┆ 8   │
└─────┴─────┘
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 1   ┆ 6   │
└─────┴─────┘
shape: (2, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 2   ┆ [8, 8]    │
│ 1   ┆ [6, 6]    │
└─────┴───────────┘
shape: (2, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [6, 6]    │
│ 2   ┆ [8, 8]    │
└─────┴───────────┘
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
│ 1   ┆ 6   │
└─────┴─────┘

Log output

keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION

Issue description

It seems the output is repeated in a list as long as the number of groups, like there was a cross-product.

My code worked fine in a pre-1.0 version of polars.

Expected behavior

df.group_by('a').agg(pl.col('b').drop_nulls().get(pl.len() // 100)) should produce a scalar column, not a list column.

Installed versions

``` --------Version info--------- Polars: 1.10.0 Index type: UInt32 Platform: macOS-14.7-arm64-arm-64bit Python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 15:57:01) [Clang 17.0.6 ] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.1.0 connectorx deltalake fastexcel fsspec gevent great_tables matplotlib 3.9.2 nest_asyncio numpy 1.26.4 openpyxl pandas 2.2.3 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 week ago

Seems to happen with any expr that can modify the length: drop_nulls, drop_nans, arg_true, etc.

df.group_by('a').agg((pl.col.b == pl.col.b).arg_true().get(pl.col.c.first()))
# shape: (2, 2)
# ┌─────┬───────────┐
# │ a   ┆ b         │
# │ --- ┆ ---       │
# │ i64 ┆ list[u32] │
# ╞═════╪═══════════╡
# │ 2   ┆ [0, 0]    │
# │ 1   ┆ [0, 0]    │
# └─────┴───────────┘

Slicing seems to be another possible workaround:

df.group_by('a').agg(pl.col.b.drop_nulls().slice(pl.col.c.first(), 1).first())
# shape: (2, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 8   │
# │ 1   ┆ 6   │
# └─────┴─────┘