pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.84k stars 1.71k forks source link

Some struct functions do not work on `.list.to_struct` #16370

Open kevinli1993 opened 1 month ago

kevinli1993 commented 1 month ago

Checks

Reproducible example

import polars as pl
ds = pl.DataFrame(dict(
    A = [["A", "B", "C"], ["D"], ["E", "F"]]
))

# Produces an error:
ds.select(pl.col("A").list.to_struct("max_width").struct.field("field_0"))   # StructFieldNotFoundError: field_0
ds.select(pl.col("A").list.to_struct("max_width").struct.field("*")) # PanicException: index out of bounds: the len is 0 but the index is 0

# Works
ds.select(pl.col("A").list.to_struct("max_width").struct.json_encode())

# Works if I use another select call
ds.select(pl.col("A").list.to_struct("max_width")).select(pl.col("A").struct.field("field_0"))
ds.select(pl.col("A").list.to_struct("max_width")).select(pl.col("A").struct.field("*"))

Log output

No response

Issue description

The outputs from the successful runs are:

ds.select(pl.col("A").list.to_struct("max_width").struct.json_encode())

shape: (3, 1)
┌─────────────────────────────────┐
│ A                               │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ {"field_0":"A","field_1":"B","… │
│ {"field_0":"D","field_1":null,… │
│ {"field_0":"E","field_1":"F","… │
└─────────────────────────────────┘
ds.select(pl.col("A").list.to_struct("max_width")).select(pl.col("A").struct.field("*"))
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 ┆ field_1 ┆ field_2 │
│ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╡
│ A       ┆ B       ┆ C       │
│ D       ┆ null    ┆ null    │
│ E       ┆ F       ┆ null    │
└─────────┴─────────┴─────────┘
ds.select(pl.col("A").list.to_struct("max_width")).select(pl.col("A").struct.field("field_0"))
shape: (3, 1)
┌─────────┐
│ field_0 │
│ ---     │
│ str     │
╞═════════╡
│ A       │
│ D       │
│ E       │
└─────────┘

Expected behavior

The expected behavior is that .struct.field("*") would work directly on the output of .to_struct(...). Now, it works if I use another .select() call, but it is not clear why that is needed.

Installed versions

``` --------Version info--------- Polars: 0.20.27 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
kevinli1993 commented 1 month ago

This seems to have something to do with setting upper_bound.

The documentation states:

When operating on a DataFrame, the schema does not need to be tracked or pre-determined, as the result will be eagerly evaluated, so you can leave this parameter unset.

but it does seem to have an effect even in eager mode:

ds.select(pl.col("A").list.to_struct(upper_bound=3).struct.field("*"))  # Works now!
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 ┆ field_1 ┆ field_2 │
│ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╡
│ A       ┆ B       ┆ C       │
│ D       ┆ null    ┆ null    │
│ E       ┆ F       ┆ null    │
└─────────┴─────────┴─────────┘

However, it seems like I need to "correctly" guess the number of upper_bound, e.g. this will break:

ds.select(pl.col("A").list.to_struct(upper_bound=999).struct.field("*"))

StructFieldNotFoundError: field_272

(and in fact, the "272" in field_272 is random, it's different each time. Probably due to parallelism).

ritchie46 commented 1 month ago

This cannot be solved dynamically. Polars needs to know the data-type before running the query on the actual data. So if your upper bound is incorrect, Polars will expand fields that don't exist in the data.

There is not much we can do here.

kevinli1993 commented 1 month ago

Ah I see - it's a consequence of LazyFrame in that ds.with_columns(...) has similar semantics as ds.lazy().with_columns(...).collect()