Open mcrumiller opened 1 year ago
I had a look into this, and I noticed that the description is slightly inaccurate. In this example, the pl.lit is matching nowhere.
The problem is that the literal doesn't broadcast when doing the join. Here is an illustrative example where only the first row is matched because the literal (effectively) expands to ["a", None, None, None]
in the join implementation
import polars as pl
df1 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
})
df2 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
'b': ['a', 'a', 'b', 'b'],
'extra_col': [101, 102, 103, 104]
})
df1.join(
df2,
left_on=['a', pl.lit('a')],
right_on=['a', 'b'],
how="left",
)
shape: (4, 3)
┌─────┬─────────┬───────────┐
│ a ┆ literal ┆ extra_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪═════════╪═══════════╡
│ 1 ┆ a ┆ 101 │
│ 2 ┆ a ┆ null │
│ 3 ┆ a ┆ null │
│ 4 ┆ a ┆ null │
└─────┴─────────┴───────────┘
@edavisau I do see the non-broadcasting issue here, good find. Not sure if #9621 can be simultaneously resolved but I do not think the literal
column should be in the result set.
@mcrumiller I noticed this as well, to me it's a fundamental flaw with the current implementation of joins. For left joins, for example, polars effectively does
left_df.with_columns(left_on columns)
right_df.drop(right_on columns)
Which doesn't work well with "calculated columns"
In my mentioned PR above, the new behaviour would be
import polars as pl
df1 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
})
df2 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
'b': ['a', 'a', 'b', 'b'],
'extra_col': [101, 102, 103, 104]
})
df1.join(
df2,
left_on=['a', pl.lit('a')],
right_on=['a', 'b'],
how="left",
)
shape: (4, 3)
┌─────┬──────┬───────────┐
│ a ┆ b ┆ extra_col │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪═══════════╡
│ 1 ┆ a ┆ 101 │
│ 2 ┆ a ┆ 102 │
│ 3 ┆ null ┆ null │
│ 4 ┆ null ┆ null │
└─────┴──────┴───────────┘
However, it's a big breaking change, and IMO should be decided simultaneously with issues like #13441 which is scheduled for 1.0 release.
It's similar to not coalescing, but it's not the same. If the join condition is itself a calculation, it shouldn't be included in the output. In SQL, for example, you can do:
SELECT
A.*, B.*
FROM A
LEFT JOIN B ON
A.value < B.value
After doing this your A.*
doesn't include a column of True/False values indicating whether A.value < B.value
. I don't believe this is intended in polars either, further evidenced by the fact that you cannot alias the join expression.
That makes sense, yep my updated condition was to check that the left_on and right_on are both not calculated expressions. To do this I compared the names and the pointer to the underlying data - here.
My point was that it all should be decided at once what the new "join behaviour" should be in 1.0. I know they are not the same issue but they are quite interdependent in my opinion.
Looks like this is now wrong but for a different reason, as of 1.0.1:
df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a ┆ a_right ┆ b │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞═════╪═════════╪══════╡
# │ 1 ┆ null ┆ null │
# │ 2 ┆ null ┆ null │
# │ 3 ┆ null ┆ null │ <-- should successfully join here
# │ 4 ┆ null ┆ null │ <-- should successfully join here
# └─────┴─────────┴──────┘
@ritchie46 unsure if should create new issue.
I have an example that is somewhat simpler in my opinion, which also returns strange result for original issue
L = pl.DataFrame({'a': [1,2]})
R = pl.DataFrame({'b': [3,4,5]})
L.join(R, left_on=pl.col('a') - pl.col('a'), right_on=pl.col('b') - pl.col('b')) # 6 lines as expected, full cross product
L.join(R, left_on=pl.lit(0), right_on=pl.lit(0)) # only 1 line, expected to be the same as previous
Update: As per this comment the issue has changed but the title is still relevant. Here is new behavior:
Issue description
A
pl.lit
value apparently matches everything, regardless of value.Reproducible example
Expected behavior
First two records should be null.
Installed versions