pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.98k stars 1.72k forks source link

`.list.to_struct()` has non-deterministic behavior #16450

Open kevinli1993 opened 1 month ago

kevinli1993 commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame(dict(value=[1,98,2,3,99,4], a=[1,1,1,2,2,1], b=['A','B','A','B','A','B']))

def repeat_me(ds):
    ## Bug: The operation below is non-deterministic
    result = (
        ds
        .group_by("a", "b")
        .agg(pl.col("value").bottom_k(3))
        .with_columns(pl.col("value").list.to_struct(upper_bound=3))
        .unnest('value')
    )
    return result.shape

# Run the function 500 times to see that two behaviors are possible.
possible_shapes = set()
for _ in range(500):
    possible_shapes.add(repeat_me(df))

print(possible_shapes)
## {(4, 3), (4, 4)}

Log output

No response

Issue description

The bug occurs either with upper_bound=3 specified or not.

That is, replacing

        .with_columns(pl.col("value").list.to_struct(upper_bound=3))

with

        .with_columns(pl.col("value").list.to_struct())

will also reproduce the bug.

Expected behavior

The bug is that both of the following outputs are possible.

In my opinion, the shape: (4, 4) ... result is correct; but it is difficult to say what's expected without knowing why the non-determinism occurs in the first place.

df.group_by("a", "b").agg(pl.col("value").bottom_k(3)).with_columns(pl.col("value").list.to_struct(upper_bound=3)).unnest('value')

shape: (4, 3)
┌─────┬─────┬─────────┐
│ a   ┆ b   ┆ field_0 │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ str ┆ i64     │
╞═════╪═════╪═════════╡
│ 2   ┆ A   ┆ 99      │
│ 1   ┆ B   ┆ 4       │
│ 1   ┆ A   ┆ 1       │
│ 2   ┆ B   ┆ 3       │
└─────┴─────┴─────────┘

shape: (4, 4)
┌─────┬─────┬─────────┬─────────┐
│ a   ┆ b   ┆ field_0 ┆ field_1 │
│ --- ┆ --- ┆ ---     ┆ ---     │
│ i64 ┆ str ┆ i64     ┆ i64     │
╞═════╪═════╪═════════╪═════════╡
│ 1   ┆ A   ┆ 1       ┆ 2       │
│ 2   ┆ A   ┆ 99      ┆ null    │
│ 1   ┆ B   ┆ 4       ┆ 98      │
│ 2   ┆ B   ┆ 3       ┆ null    │
└─────┴─────┴─────────┴─────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.28 Index type: UInt32 Platform: macOS-14.4.1-arm64-arm-64bit Python: 3.12.3 (main, Apr 12 2024, 17:16:04) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.5.0 gevent: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 month ago

It's happening due to n_field_strategy

Because group_by is returning random order, the length of the first list changes.

df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ value     │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 2   ┆ A   ┆ [99]      │
# │ 2   ┆ B   ┆ [3]       │
# │ 1   ┆ A   ┆ [1, 2]    │
# │ 1   ┆ B   ┆ [4, 98]   │
# └─────┴─────┴───────────┘

df.group_by("a", "b").agg(pl.col("value").bottom_k(3))
# shape: (4, 3)
# ┌─────┬─────┬───────────┐
# │ a   ┆ b   ┆ value     │
# │ --- ┆ --- ┆ ---       │
# │ i64 ┆ str ┆ list[i64] │
# ╞═════╪═════╪═══════════╡
# │ 1   ┆ A   ┆ [1, 2]    │
# │ 2   ┆ B   ┆ [3]       │
# │ 1   ┆ B   ┆ [4, 98]   │
# │ 2   ┆ A   ┆ [99]      │
# └─────┴─────┴───────────┘

You would need n_field_strategy="max_width"

.list.to_struct("max_width", upper_bound=3))
kevinli1993 commented 1 month ago

Makes sense now! I guess this means n_field_strategy=first_non_null is only useful when the number of elements is known to be the same.

Maybe (?) not a bug, but we might consider adding a warning to the documentation since it's a surprising consequence.

cmdlineluser commented 1 month ago

Yeah - it is a bit of a footgun.

There was a Notes section added with an explanation - but perhaps a Warning should also be added pointing to that section.

kevinli1993 commented 1 month ago

Agreed.

Would it make sense for .to_struct to accept an argument exact=n, so that the resulting final struct has exactly n fields (filling with null when necessary)?

It would behave similar to str.split and str.split_exact. (Maybe we could create a new function .to_struct_exact.)

cmdlineluser commented 1 month ago

Yeah, something similar came up in https://github.com/pola-rs/polars/issues/15742 recently.

When n is known, it's effectively:

pl.struct(
    field_0 = pl.col("value").list.get(0),
    field_1 = pl.col("value").list.get(1),
    field_2 = pl.col("value").list.get(2)
)

But not having to type all that out seems like it would be useful.