pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.22k stars 1.84k forks source link

`.struct.field()` after `shuffle()` seems to produce incorrect results #18234

Open cmdlineluser opened 4 weeks ago

cmdlineluser commented 4 weeks ago

Checks

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})

Log output

No response

Issue description

Noticed while experimenting with https://github.com/pola-rs/polars/issues/18233

Using a struct should keep each "row" intact when shuffling:

df.select(pl.struct(pl.all()).shuffle())
# shape: (3, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {3,3}     │
# │ {1,1}     │
# │ {2,2}     │
# └───────────┘

Adding field extraction seems to break this:

df.select(pl.struct(pl.all()).shuffle().struct.field("*"))
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1   ┆ 3   │
# │ 2   ┆ 2   │
# │ 3   ┆ 1   │
# └─────┴─────┘

It should be equivalent to unnest in this case?

df.select(pl.struct(pl.all()).shuffle()).unnest("a")
# shape: (3, 2)
# ┌─────┬─────┐
# │ a   ┆ b   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 2   ┆ 2   │
# │ 3   ┆ 3   │
# │ 1   ┆ 1   │
# └─────┴─────┘

Expected behavior

field extraction and unnest should produce the same result?

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: macOS-13.6.1-arm64-arm-64bit Python: 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 3 weeks ago

Right, field expansions writes expanded expressions. This is correct, but now we get the effects of calling shuffle multiple times.

I think we must assign a seed upon shuffle expression creation to circumvent these problems. We could do that implictly.

ritchie46 commented 3 weeks ago

Hmm... it is more complicated. As we want to reseed groups in a group by. :thinking: