Closed accer-bndes closed 5 months ago
It's not really a bug. That's what is intended when you use an expression in the join. The way to get what you want is
(test0
.with_columns(
pl.col("datetime").dt.date()
)
.join(
test1
.with_columns(
pl.col("datetime").dt.date(),
datetime_right="datetime"
),
on="datetime")
)
The above could be more concise if this one is implemented.
Maybe, this could be reformulated as a feature request to allow something like temp_left_on and/or temp_right_on so that it would join by that expression but not keep it.
It would only make sense in the context of allowing the retention of right keys.
With both features you could then do
test0.join(test1, left_on=pl.col("datetime").dt.date(),
temp_right_on=pl.col("datetime").dt.date(),
keep_right=True
)
Oh, ok. That workaround loses the time from the left dataframe, but I know I can:
date_tmp = pl.col("datetime").dt.date().alias("date")
test0.with_columns(date_tmp).join(
test1.with_columns(date_tmp), on="date"
).drop("date")
I don't think this feature would be necessary, I'm ok with creating and dropping the temporary column.
Just didn't expect on
to change the original column... maybe a documentation issue?
I would argue that this is indeed a bug as it differs from the SQL standard (which is what polars seems to be wanting to align to particularly in the case of joins). The docstring for join
even says "Join in SQL-like fashion."
In SQL, regardless of whether a join key is an expression or just a plain column, it does not mutate the data.
I think another example using left_on
and right_on
really highlights why I believe this is a bug
df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
"amount": ["1", "2"],
"value": [25, 75]
})
print(
df_left
.join(
df_right,
# I only want to use this expression to join the dfs,
# not mutate data in this column
left_on=pl.col("category").str.extract(r"(\d)"),
right_on="amount"
)
# Now I need to do another join on the `category` column in the left table
# but it has been mutated by `left_on` to something it isn't.
)
# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞══════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └──────────┴───────┘
Compare this to SQL via duckdb
query = r"""
select df_left.*, df_right.value
from df_left
inner join df_right
-- `regexp_extract` is akin to polars `pl.col("...").str.extract`
on regexp_extract(df_left.category, '\d') = df_right.amount
"""
print(duckdb.sql(query))
# ┌──────────┬───────┐
# │ category │ value │
# │ varchar │ int64 │
# ├──────────┼───────┤
# │ amount_1 │ 25 │
# │ amount_2 │ 75 │
# └──────────┴───────┘
Postgres has the same behaviour as duckdb. Polars can't run this query due to "InvalidOperationError: Unsupported SQL join constraint"
As a final point as to why I think the current behaviour is buggy, putting the expression in right_on
expectedly won't mutate anything (but left_on
does). Although the output here is expected, it does highlight the inconsistency
print(
df_right.join(
df_left,
# `amount` doesn't get mutated even though `category` did in the previous example
left_on="amount",
right_on=pl.col("category").str.extract(r"(\d)"),
)
)
# shape: (2, 2)
# ┌────────┬───────┐
# │ amount ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └────────┴───────┘
And for good measure, an example like OPs showing on
mutating the data in the left table
df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
"category": ["1", "2"],
"value": [25, 75]
})
# Again, the `category` is mutated in the left table
print(df_left.join(df_right, on=pl.col("category").str.extract(r"(\d)")))
# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞══════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └──────────┴───────┘
FYI @stinodego for your thoughts
Hi @ritchie46 and @stinodego, may I be request your opinion on whether this issue is a bug, or intended bahviour?
This looks wrong to me.
A bit more of a compact repro:
from datetime import datetime
import polars as pl
datetimes = [datetime(2022, 1, 1), datetime(2022, 1, 1, 12), datetime(2022, 1, 2)]
df1 = pl.DataFrame({"dt": datetimes, "value": 0})
df2 = pl.DataFrame({"dt": datetimes, "value": 1})
result = df1.join(df2, on=pl.col("dt").dt.date())
print(result)
shape: (5, 3)
┌────────────┬───────┬─────────────┐
│ dt ┆ value ┆ value_right │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 │
╞════════════╪═══════╪═════════════╡
│ 2022-01-01 ┆ 0 ┆ 1 │
│ 2022-01-01 ┆ 0 ┆ 1 │
│ 2022-01-01 ┆ 0 ┆ 1 │
│ 2022-01-01 ┆ 0 ┆ 1 │
│ 2022-01-02 ┆ 0 ┆ 1 │
└────────────┴───────┴─────────────┘
I think this is actually related to our story on coalesce
in joins.
I definitely agree it's a bug that expressions in on
, left_on
and right_on
modify the output, they should not.
However, when you pass an expression instead of just a column into these arguments, coalesce
-ing is no longer allowed if the expression isn't an injective function.
I think I will just turn off coalescing if the join expressions are anything other than col
.
Checks
[x] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
When joining 2 DataFrames, using expressions as join conditions may mutate the original column values.
Expected behavior
Installed versions