pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.29k stars 1.96k forks source link

list.set_intersection operates on the wrong columns when multiple columns are selected with pl.col #18795

Closed caleb-lindgren closed 1 week ago

caleb-lindgren commented 1 month ago

Checks

Reproducible example

import polars as pl

df = pl.LazyFrame({
    "a": [
        [1, 2],
        [1, 2],
    ],
    "b": [
        [1, 3],
        [2, 3],
    ],
    "c": [
        [1, 2],
        [1, 2],
    ]
})

separate = (
    df.with_columns(pl.col("a").list.set_intersection("c"))
    .with_columns(pl.col("b").list.set_intersection("c"))
)

together = (
    df.with_columns(pl.col("a", "b").list.set_intersection("c"))
)

# Print the base dataframe
print(df.collect())

print("-" * 50)

# Print the result of performing the operations separately
print(separate)
print(separate.collect())

print("-" * 50)

# Print the result of trying to perform the operations together
print(together)
print(together.collect())

Output:

shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2]    ┆ [1, 3]    ┆ [1, 2]    │
│ [1, 2]    ┆ [2, 3]    ┆ [1, 2]    │
└───────────┴───────────┴───────────┘
--------------------------------------------------
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [col("b").list.intersection([col("c")])]
   WITH_COLUMNS:
   [col("a").list.intersection([col("c")])]
    DF ["a", "b", "c"]; PROJECT */3 COLUMNS; SELECTION: None
shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2]    ┆ [1]       ┆ [1, 2]    │
│ [1, 2]    ┆ [2]       ┆ [1, 2]    │
└───────────┴───────────┴───────────┘
--------------------------------------------------
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

 WITH_COLUMNS:
 [col("a").list.intersection([col("b"), col("c")])]
  DF ["a", "b", "c"]; PROJECT */3 COLUMNS; SELECTION: None
shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1]       ┆ [1, 3]    ┆ [1, 2]    │
│ [2]       ┆ [2, 3]    ┆ [1, 2]    │
└───────────┴───────────┴───────────┘

Log output

No response

Issue description

I am trying to replace a with a ^ c and replace b with b ^ c. If I perform these two operations separately, it works. However, if I try to use pl.col to select both a and b and perform both operations at the same time, instead a is replaced with a ^ b ^ c and nothing happens to b.

Expected behavior

When I use pl.col to select both a and b and perform on each the set interaction with c, it should behave the same as when I compute the two intersections separately.

Installed versions

``` --------Version info--------- Polars: 1.7.1 Index type: UInt32 Platform: Linux-6.10.9-artix1-2-x86_64-with-glibc2.40 Python: 3.11.5 (main, Oct 18 2023, 09:37:15) [GCC 13.2.1 20230801] ----Optional dependencies---- adbc_driver_manager altair 5.1.2 cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib 3.8.0 nest_asyncio 1.5.8 numpy 1.26.1 openpyxl 3.1.2 pandas 2.1.1 pyarrow 16.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
cmdlineluser commented 1 month ago

Can reproduce.

It may be easier to see the bug using select - as b disappears.

df.select(pl.col("a", "b").list.set_intersection("c")).collect()
# shape: (2, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ [1]       │
# │ [2]       │
# └───────────┘

It also happens for DataFrames, so doesn't appear to be an optimizer issue.