pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.23k stars 1.67k forks source link

Coalescing outer join panics (and/or loses) columns from right frame if join keys expressions have overlapping names #16289

Closed wence- closed 4 weeks ago

wence- commented 1 month ago

Checks

Reproducible example

import polars as pl
left = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 6, 7]})
right = pl.DataFrame({"a": [2, 3, 4], "c": [4, 5, 6]})
left.join(right, on=[pl.col("a")], how="outer_coalesce")
# shape: (4, 4)
# ┌─────┬──────┬──────┬─────────┐
# │ a   ┆ b    ┆ c    ┆ c_right │
# │ --- ┆ ---  ┆ ---  ┆ ---     │
# │ i64 ┆ i64  ┆ i64  ┆ i64     │
# ╞═════╪══════╪══════╪═════════╡
# │ 2   ┆ 4    ┆ 6    ┆ 4       │
# │ 3   ┆ 5    ┆ 7    ┆ 5       │
# │ 4   ┆ null ┆ null ┆ 6       │
# │ 1   ┆ 3    ┆ 5    ┆ null    │
# └─────┴──────┴──────┴─────────┘

# nonsensical, but ok
left.join(right, on=[pl.col("a"), pl.col("a")], how="outer_coalesce")
# shape: (4, 3)
# ┌─────┬──────┬──────┐
# │ a   ┆ b    ┆ c    │
# │ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64  ┆ i64  │
# ╞═════╪══════╪══════╡
# │ 2   ┆ 4    ┆ 6    │
# │ 3   ┆ 5    ┆ 7    │
# │ 4   ┆ null ┆ null │
# │ 1   ┆ 3    ┆ 5    │
# └─────┴──────┴──────┘

# even more
left.join(right, on=[pl.col("a"), pl.col("a"), pl.col("a")], how="outer_coalesce")
# thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/general.rs:90:25:
# removal index (is 3) should be < len (is 3)

Log output

run JoinExec
join parallel: true
OUTER join dataframes finished
run JoinExec
join parallel: true
OUTER join dataframes finished
run JoinExec
join parallel: true

Issue description

Looks like coalescing outer join just attempts to eat as many columns from the right dataframe as there are key columns in the join.

Expected behavior

I would expect all three of these (the latter two being odd) mathematically equivalent join expressions to give me the same result.

Or, complain that we're going to produce overlapping output key names.

Installed versions

``` --------Version info--------- Polars: 0.20.26 Index type: UInt32 Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35 Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: 0.11.0 cloudpickle: 3.0.0 connectorx: 0.3.3 deltalake: 0.17.4 fastexcel: 0.10.4 fsspec: 2024.3.1 gevent: 24.2.1 hvplot: 0.10.0 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.0.0 pydantic: 2.7.1 pyiceberg: pyxlsb: 1.0.10 sqlalchemy: 2.0.30 torch: 2.3.0.post300 xlsx2csv: 0.8.2 xlsxwriter: 3.2.0 ```
ritchie46 commented 4 weeks ago

Does this happen if we don't join on twice the same name? We should raise as it doesn't make sense to join on duplicate columns.

wence- commented 3 weeks ago

Thanks!