Closed wence- closed 4 weeks ago
import polars as pl left = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 6, 7]}) right = pl.DataFrame({"a": [2, 3, 4], "c": [4, 5, 6]}) left.join(right, on=[pl.col("a")], how="outer_coalesce") # shape: (4, 4) # ┌─────┬──────┬──────┬─────────┐ # │ a ┆ b ┆ c ┆ c_right │ # │ --- ┆ --- ┆ --- ┆ --- │ # │ i64 ┆ i64 ┆ i64 ┆ i64 │ # ╞═════╪══════╪══════╪═════════╡ # │ 2 ┆ 4 ┆ 6 ┆ 4 │ # │ 3 ┆ 5 ┆ 7 ┆ 5 │ # │ 4 ┆ null ┆ null ┆ 6 │ # │ 1 ┆ 3 ┆ 5 ┆ null │ # └─────┴──────┴──────┴─────────┘ # nonsensical, but ok left.join(right, on=[pl.col("a"), pl.col("a")], how="outer_coalesce") # shape: (4, 3) # ┌─────┬──────┬──────┐ # │ a ┆ b ┆ c │ # │ --- ┆ --- ┆ --- │ # │ i64 ┆ i64 ┆ i64 │ # ╞═════╪══════╪══════╡ # │ 2 ┆ 4 ┆ 6 │ # │ 3 ┆ 5 ┆ 7 │ # │ 4 ┆ null ┆ null │ # │ 1 ┆ 3 ┆ 5 │ # └─────┴──────┴──────┘ # even more left.join(right, on=[pl.col("a"), pl.col("a"), pl.col("a")], how="outer_coalesce") # thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/general.rs:90:25: # removal index (is 3) should be < len (is 3)
run JoinExec join parallel: true OUTER join dataframes finished run JoinExec join parallel: true OUTER join dataframes finished run JoinExec join parallel: true
Looks like coalescing outer join just attempts to eat as many columns from the right dataframe as there are key columns in the join.
I would expect all three of these (the latter two being odd) mathematically equivalent join expressions to give me the same result.
Or, complain that we're going to produce overlapping output key names.
Does this happen if we don't join on twice the same name? We should raise as it doesn't make sense to join on duplicate columns.
Thanks!
Checks
Reproducible example
Log output
Issue description
Looks like coalescing outer join just attempts to eat as many columns from the right dataframe as there are key columns in the join.
Expected behavior
I would expect all three of these (the latter two being odd) mathematically equivalent join expressions to give me the same result.
Or, complain that we're going to produce overlapping output key names.
Installed versions