pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.13k stars 1.83k forks source link

LazyFrame get struct element after aggregation fails to see struct field #11136

Closed dkumor closed 1 month ago

dkumor commented 11 months ago

Checks

Reproducible example

pl.DataFrame([{
    "a":1,
    "b":2
}]).lazy().group_by('a').agg(
    pl.struct([col('b')]).alias('s')
).with_columns(
    col("s").list.first().struct.field('b').alias('b_val')
).collect()

Output:

Traceback (most recent call last):
  File "/Users/dkumor/pl_reproduce.py", line 9, in <module>
    ).collect()
  File "/opt/homebrew/lib/python3.10/site-packages/polars/utils/deprecation.py", line 95, in wrapper
    return function(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1695, in collect
    return wrap_df(ldf.collect())
exceptions.StructFieldNotFoundError: b

Error originated just after this operation:
AGGREGATE
    [col("b").as_struct().alias("s")] BY [col("a")] FROM
  DF ["a", "b"]; PROJECT */2 COLUMNS; SELECTION: "None"

Log output

no output

Issue description

Doing an aggregation that creates a list of structs, then getting a field from the first list element fails in LazyFrame. Running collect before getting the struct field succeeds.

Expected behavior

pl.DataFrame([{
    "a":1,
    "b":2
}]).lazy().group_by('a').agg(
    pl.struct([col('b')]).alias('s')
).collect().with_columns(
    col("s").list.first().struct.field('b').alias('b_val')
)

Output:

shape: (1, 3)
┌─────┬─────────────────┬───────┐
│ a   ┆ s               ┆ b_val │
│ --- ┆ ---             ┆ ---   │
│ i64 ┆ list[struct[1]] ┆ i64   │
╞═════╪═════════════════╪═══════╡
│ 1   ┆ [{2}]           ┆ 2     │
└─────┴─────────────────┴───────┘

Installed versions

``` --------Version info--------- Polars: 0.19.2 Index type: UInt32 Platform: macOS-13.5.2-arm64-arm-64bit Python: 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: 2022.11.0 matplotlib: 3.6.3 numpy: 1.25.2 pandas: 2.1.0 pyarrow: 12.0.0 pydantic: sqlalchemy: xlsx2csv: 0.8.1 xlsxwriter: ```
reswqa commented 11 months ago

When compute schema of the expression in agg, we did not perform any special processing for Function and AnonymousFunction. Actually, we only take column as a special case in AggContext. That is to say:

the schema of

pl.DataFrame([{
    "a":1,
    "b":2
}]).lazy().group_by('a').agg(
    pl.struct([col('b')]).alias('s')
)

is {'a ': Int64,'s': Struct ([Field ('b', Int64)]}, which should actually be {'a ': Int64,'s': List(Struct ([Field ('b', Int64)]}).

@ritchie46 What do you think about this? How about nesting the field's dtype of each expr in agg into a List, except for those that auto-explode like sum... 🤔

dkumor commented 1 month ago

This has been fixed. Closing.