pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.43k stars 1.97k forks source link

join/on expression mutates column #13220

Closed accer-bndes closed 5 months ago

accer-bndes commented 11 months ago

Checks

Reproducible example

import polars as pl
from datetime import datetime

test0 = pl.DataFrame({
    "datetime": pl.datetime_range(datetime(2022, 1, 1), datetime(2022, 1, 2), "1h", eager=True),
    "value": 0,
})

test1 = pl.DataFrame({
    "datetime": pl.datetime_range(datetime(2022, 1, 1), datetime(2022, 1, 2), "1h", eager=True),
    "value": 1,
})

print(test0)
print(test0.join(test1, on=pl.col("datetime").dt.date()))

Log output

shape: (25, 2)
┌─────────────────────┬───────┐
│ datetime            ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2022-01-01 00:00:00 ┆ 0     │
│ 2022-01-01 01:00:00 ┆ 0     │
│ 2022-01-01 02:00:00 ┆ 0     │
│ 2022-01-01 03:00:00 ┆ 0     │
│ …                   ┆ …     │
│ 2022-01-01 21:00:00 ┆ 0     │
│ 2022-01-01 22:00:00 ┆ 0     │
│ 2022-01-01 23:00:00 ┆ 0     │
│ 2022-01-02 00:00:00 ┆ 0     │
└─────────────────────┴───────┘
shape: (577, 3)
┌────────────┬───────┬─────────────┐
│ datetime   ┆ value ┆ value_right │
│ ---        ┆ ---   ┆ ---         │
│ date       ┆ i64   ┆ i64         │
╞════════════╪═══════╪═════════════╡
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ …          ┆ …     ┆ …           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-02 ┆ 0     ┆ 1           │
└────────────┴───────┴─────────────┘

Issue description

When joining 2 DataFrames, using expressions as join conditions may mutate the original column values.

Expected behavior

shape: (577, 4)
┌─────────────────────┬───────┬─────────────────────┬─────────────┐
│ datetime            ┆ value ┆ datetime_right      ┆ value_right │
│ ---                 ┆ ---   ┆ ---                 ┆ ---         │
│ datetime[μs]        ┆ i64   ┆ datetime[μs]        ┆ i64         │
╞═════════════════════╪═══════╪═════════════════════╪═════════════╡
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 00:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 01:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 02:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 03:00:00 ┆ 1           │
│ …                   ┆ …     ┆ …                   ┆ …           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 21:00:00 ┆ 1           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 22:00:00 ┆ 1           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 23:00:00 ┆ 1           │
│ 2022-01-02 00:00:00 ┆ 0     ┆ 2022-01-02 00:00:00 ┆ 1           │
└─────────────────────┴───────┴─────────────────────┴─────────────┘

Installed versions

``` --------Version info--------- Polars: 0.20.2 Index type: UInt32 Platform: Linux-6.6.7-zen1-1-zen-x86_64-with-glibc2.38 Python: 3.12.0 (main, Oct 12 2023, 16:15:51) [GCC 13.2.1 20230801] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2023.12.2 gevent: matplotlib: 3.8.2 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: pydantic: 2.5.3 pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
deanm0000 commented 11 months ago

It's not really a bug. That's what is intended when you use an expression in the join. The way to get what you want is

(test0
      .with_columns(
          pl.col("datetime").dt.date()
          )
       .join(
           test1
           .with_columns(
               pl.col("datetime").dt.date(),
               datetime_right="datetime"
           ), 
                on="datetime")
)

The above could be more concise if this one is implemented.

Maybe, this could be reformulated as a feature request to allow something like temp_left_on and/or temp_right_on so that it would join by that expression but not keep it.

It would only make sense in the context of allowing the retention of right keys.

With both features you could then do

test0.join(test1, left_on=pl.col("datetime").dt.date(),
                              temp_right_on=pl.col("datetime").dt.date(),
                               keep_right=True
)
accer-bndes commented 11 months ago

Oh, ok. That workaround loses the time from the left dataframe, but I know I can:

date_tmp = pl.col("datetime").dt.date().alias("date")

test0.with_columns(date_tmp).join(
    test1.with_columns(date_tmp), on="date"
).drop("date")

I don't think this feature would be necessary, I'm ok with creating and dropping the temporary column. Just didn't expect on to change the original column... maybe a documentation issue?

henryharbeck commented 10 months ago

I would argue that this is indeed a bug as it differs from the SQL standard (which is what polars seems to be wanting to align to particularly in the case of joins). The docstring for join even says "Join in SQL-like fashion."

In SQL, regardless of whether a join key is an expression or just a plain column, it does not mutate the data.

I think another example using left_on and right_on really highlights why I believe this is a bug

df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
    "amount": ["1", "2"],
    "value": [25, 75]
})

print(
    df_left
    .join(
        df_right,
        # I only want to use this expression to join the dfs,
        # not mutate data in this column
        left_on=pl.col("category").str.extract(r"(\d)"),
        right_on="amount"
    )
    # Now I need to do another join on the `category` column in the left table
    # but it has been mutated by `left_on` to something it isn't.
)

# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ ---      ┆ ---   │
# │ str      ┆ i64   │
# ╞══════════╪═══════╡
# │ 1        ┆ 25    │
# │ 2        ┆ 75    │
# └──────────┴───────┘

Compare this to SQL via duckdb

query = r"""
select df_left.*, df_right.value
from df_left
inner join df_right
-- `regexp_extract` is akin to polars `pl.col("...").str.extract`
on regexp_extract(df_left.category, '\d') = df_right.amount
"""

print(duckdb.sql(query))

# ┌──────────┬───────┐
# │ category │ value │
# │ varchar  │ int64 │
# ├──────────┼───────┤
# │ amount_1 │    25 │
# │ amount_2 │    75 │
# └──────────┴───────┘

Postgres has the same behaviour as duckdb. Polars can't run this query due to "InvalidOperationError: Unsupported SQL join constraint"

As a final point as to why I think the current behaviour is buggy, putting the expression in right_on expectedly won't mutate anything (but left_on does). Although the output here is expected, it does highlight the inconsistency

print(
    df_right.join(
        df_left,
        # `amount` doesn't get mutated even though `category` did in the previous example
        left_on="amount",
        right_on=pl.col("category").str.extract(r"(\d)"),
    )
)

# shape: (2, 2)
# ┌────────┬───────┐
# │ amount ┆ value │
# │ ---    ┆ ---   │
# │ str    ┆ i64   │
# ╞════════╪═══════╡
# │ 1      ┆ 25    │
# │ 2      ┆ 75    │
# └────────┴───────┘

And for good measure, an example like OPs showing on mutating the data in the left table

df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
    "category": ["1", "2"],
    "value": [25, 75]
})

# Again, the `category` is mutated in the left table
print(df_left.join(df_right, on=pl.col("category").str.extract(r"(\d)")))

# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ ---      ┆ ---   │
# │ str      ┆ i64   │
# ╞══════════╪═══════╡
# │ 1        ┆ 25    │
# │ 2        ┆ 75    │
# └──────────┴───────┘

FYI @stinodego for your thoughts

henryharbeck commented 10 months ago

Hi @ritchie46 and @stinodego, may I be request your opinion on whether this issue is a bug, or intended bahviour?

stinodego commented 10 months ago

This looks wrong to me.

A bit more of a compact repro:

from datetime import datetime

import polars as pl

datetimes = [datetime(2022, 1, 1), datetime(2022, 1, 1, 12), datetime(2022, 1, 2)]
df1 = pl.DataFrame({"dt": datetimes, "value": 0})
df2 = pl.DataFrame({"dt": datetimes, "value": 1})
result = df1.join(df2, on=pl.col("dt").dt.date())
print(result)
shape: (5, 3)
┌────────────┬───────┬─────────────┐
│ dt         ┆ value ┆ value_right │
│ ---        ┆ ---   ┆ ---         │
│ date       ┆ i64   ┆ i64         │
╞════════════╪═══════╪═════════════╡
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-02 ┆ 0     ┆ 1           │
└────────────┴───────┴─────────────┘
orlp commented 7 months ago

I think this is actually related to our story on coalesce in joins.

I definitely agree it's a bug that expressions in on, left_on and right_on modify the output, they should not.

However, when you pass an expression instead of just a column into these arguments, coalesce-ing is no longer allowed if the expression isn't an injective function.

ritchie46 commented 5 months ago

I think I will just turn off coalescing if the join expressions are anything other than col.