`pl.lit` doesn't work properly as a join key

mcrumiller commented 1 year ago

Update: As per this comment the issue has changed but the title is still relevant. Here is new behavior:

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

Issue description

A pl.lit value apparently matches everything, regardless of value.

Reproducible example

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(
    df2,
    left_on=['a', pl.lit('b')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 2)
┌─────┬─────────┐
│ a   ┆ literal │
│ --- ┆ ---     │
│ str ┆ str     │
╞═════╪═════════╡
│ 1   ┆ b       │
│ 2   ┆ b       │
│ 3   ┆ b       │
│ 4   ┆ b       │
└─────┴─────────┘

Expected behavior

First two records should be null.

Installed versions

``` --------Version info--------- Polars: 0.18.1 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] ----Optional dependencies---- numpy: 1.24.3 pandas: 2.0.0 pyarrow: 11.0.0 connectorx: 0.3.2a3 deltalake: fsspec: matplotlib: 3.7.1 xlsx2csv: 0.8.1 xlsxwriter: 3.1.0 ```

edavisau commented 10 months ago

I had a look into this, and I noticed that the description is slightly inaccurate. In this example, the pl.lit is matching nowhere.

The problem is that the literal doesn't broadcast when doing the join. Here is an illustrative example where only the first row is matched because the literal (effectively) expands to ["a", None, None, None] in the join implementation

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})

df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 3)
┌─────┬─────────┬───────────┐
│ a   ┆ literal ┆ extra_col │
│ --- ┆ ---     ┆ ---       │
│ str ┆ str     ┆ i64       │
╞═════╪═════════╪═══════════╡
│ 1   ┆ a       ┆ 101       │
│ 2   ┆ a       ┆ null      │
│ 3   ┆ a       ┆ null      │
│ 4   ┆ a       ┆ null      │
└─────┴─────────┴───────────┘

mcrumiller commented 10 months ago

@edavisau I do see the non-broadcasting issue here, good find. Not sure if #9621 can be simultaneously resolved but I do not think the literal column should be in the result set.

edavisau commented 10 months ago

@mcrumiller I noticed this as well, to me it's a fundamental flaw with the current implementation of joins. For left joins, for example, polars effectively does

left_df.with_columns(left_on columns)
right_df.drop(right_on columns)

Which doesn't work well with "calculated columns"

In my mentioned PR above, the new behaviour would be

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})

df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 3)
┌─────┬──────┬───────────┐
│ a   ┆ b    ┆ extra_col │
│ --- ┆ ---  ┆ ---       │
│ str ┆ str  ┆ i64       │
╞═════╪══════╪═══════════╡
│ 1   ┆ a    ┆ 101       │
│ 2   ┆ a    ┆ 102       │
│ 3   ┆ null ┆ null      │
│ 4   ┆ null ┆ null      │
└─────┴──────┴───────────┘

However, it's a big breaking change, and IMO should be decided simultaneously with issues like #13441 which is scheduled for 1.0 release.

mcrumiller commented 10 months ago

It's similar to not coalescing, but it's not the same. If the join condition is itself a calculation, it shouldn't be included in the output. In SQL, for example, you can do:

SELECT
  A.*, B.*
FROM A
LEFT JOIN B ON
  A.value < B.value

After doing this your A.* doesn't include a column of True/False values indicating whether A.value < B.value. I don't believe this is intended in polars either, further evidenced by the fact that you cannot alias the join expression.

edavisau commented 10 months ago

That makes sense, yep my updated condition was to check that the left_on and right_on are both not calculated expressions. To do this I compared the names and the pointer to the underlying data - here.

My point was that it all should be decided at once what the new "join behaviour" should be in 1.0. I know they are not the same issue but they are quite interdependent in my opinion.

mcrumiller commented 4 months ago

Looks like this is now wrong but for a different reason, as of 1.0.1:

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

@ritchie46 unsure if should create new issue.

AlexeyDmitriev commented 3 months ago

I have an example that is somewhat simpler in my opinion, which also returns strange result for original issue

L = pl.DataFrame({'a': [1,2]})
R = pl.DataFrame({'b': [3,4,5]})
L.join(R, left_on=pl.col('a') - pl.col('a'), right_on=pl.col('b') - pl.col('b')) # 6 lines as expected, full cross product
L.join(R, left_on=pl.lit(0), right_on=pl.lit(0)) # only 1 line, expected to be the same as previous

pola-rs / polars