Open kevinli1993 opened 1 month ago
This seems to have something to do with setting upper_bound
.
The documentation states:
When operating on a DataFrame, the schema does not need to be tracked or pre-determined, as the result will be eagerly evaluated, so you can leave this parameter unset.
but it does seem to have an effect even in eager mode:
ds.select(pl.col("A").list.to_struct(upper_bound=3).struct.field("*")) # Works now!
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 ┆ field_1 ┆ field_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪═════════╪═════════╡
│ A ┆ B ┆ C │
│ D ┆ null ┆ null │
│ E ┆ F ┆ null │
└─────────┴─────────┴─────────┘
However, it seems like I need to "correctly" guess the number of upper_bound
, e.g. this will break:
ds.select(pl.col("A").list.to_struct(upper_bound=999).struct.field("*"))
StructFieldNotFoundError: field_272
(and in fact, the "272" in field_272
is random, it's different each time. Probably due to parallelism).
This cannot be solved dynamically. Polars needs to know the data-type before running the query on the actual data. So if your upper bound is incorrect, Polars will expand fields that don't exist in the data.
There is not much we can do here.
Ah I see - it's a consequence of LazyFrame in that ds.with_columns(...)
has similar semantics as ds.lazy().with_columns(...).collect()
Checks
Reproducible example
Log output
No response
Issue description
The outputs from the successful runs are:
Expected behavior
The expected behavior is that
.struct.field("*")
would work directly on the output of.to_struct(...)
. Now, it works if I use another.select()
call, but it is not clear why that is needed.Installed versions