pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.29k stars 1.85k forks source link

Don't disable coalesce when joining on expressions #18194

Open mcrumiller opened 1 month ago

mcrumiller commented 1 month ago

Description

The documentation for join states the following for the coalesce parameter:

coalesce

Coalescing behavior (merging of join columns).

  • None: -> join specific.
  • True: -> Always coalesce join columns.
  • False: -> Never coalesce join columns.

Note that joining on any other expressions than col will turn off coalescing.

I'm not sure why joining on other expressions disables coalescing. To get around this behavior, I find myself constantly using the following patterns:

df2 = df2.with_columns(
    col("name_1").operations().alias("tmp_name_1"),
    col("name_2").operations().alias("tmp_name_2"),
    col("name_3").operations().alias("tmp_name_3"),
)
df1.join(df2,
    left_on=["name_1", "name_2", "name_3"],
    right_on=["tmp_name_1", "tmp_name_2", "tmp_name_3"],
    how="left",
    coalesce=True,
).drop("tmp_name_1", "tmp_name_2", "tmp_name_3")

Instead, it would be preferable to be able to do:


df1.join(df2,
    left_on=["name_1", "name_2", "name_3"],
    right_on=[col("name_1").operations(), col("name_2").operations(), col("name_3").operations()],
    how="left",
    coalesce=True
)
deanm0000 commented 1 month ago

One small tweak to your workaround

df2 = df2.with_columns(
    expr1.alias("name_1"),
    expr2.alias("name_2"),
    expr3.alias("name_3"),
)
df1.join(df2,
    on=["name_1", "name_2", "name_3"],
    how="left",
    coalesce=True,
)

I'm guessing that it was an implementation difficulty that made it the way it is rather than a planned decision. Maybe to do with handling aliases.

mcrumiller commented 1 month ago

@deanm0000 the problem is that name_1, name_2, and name_3 already exist in df2, but I need to do an operation on them, for example a .shift, so I cannot re-alias to the same name.

I updated my example above to make that a bit clearer.

ritchie46 commented 4 weeks ago

It adds a lot of complexity and I don't think I am not sure it makes sense either.