Open kevinli1993 opened 1 month ago
It's happening due to n_field_strategy
Because group_by
is returning random order, the length of the first list changes.
df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a ┆ b ┆ value │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 2 ┆ A ┆ [99] │
# │ 2 ┆ B ┆ [3] │
# │ 1 ┆ A ┆ [1, 2] │
# │ 1 ┆ B ┆ [4, 98] │
# └─────┴─────┴───────────┘
df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a ┆ b ┆ value │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 1 ┆ A ┆ [1, 2] │
# │ 2 ┆ B ┆ [3] │
# │ 1 ┆ B ┆ [4, 98] │
# │ 2 ┆ A ┆ [99] │
# └─────┴─────┴───────────┘
You would need n_field_strategy="max_width"
.list.to_struct("max_width", upper_bound=3))
Makes sense now! I guess this means n_field_strategy=first_non_null
is only useful when the number of elements is known to be the same.
Maybe (?) not a bug, but we might consider adding a warning to the documentation since it's a surprising consequence.
Yeah - it is a bit of a footgun.
There was a Notes section added with an explanation - but perhaps a Warning should also be added pointing to that section.
Agreed.
Would it make sense for .to_struct
to accept an argument exact=n
, so that the resulting final struct has exactly n
fields (filling with null
when necessary)?
It would behave similar to str.split
and str.split_exact
. (Maybe we could create a new function .to_struct_exact
.)
Yeah, something similar came up in https://github.com/pola-rs/polars/issues/15742 recently.
When n
is known, it's effectively:
pl.struct(
field_0 = pl.col("value").list.get(0),
field_1 = pl.col("value").list.get(1),
field_2 = pl.col("value").list.get(2)
)
But not having to type all that out seems like it would be useful.
Checks
Reproducible example
Log output
No response
Issue description
The bug occurs either with
upper_bound=3
specified or not.That is, replacing
with
will also reproduce the bug.
Expected behavior
The bug is that both of the following outputs are possible.
In my opinion, the
shape: (4, 4) ...
result is correct; but it is difficult to say what's expected without knowing why the non-determinism occurs in the first place.Installed versions