pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.26k stars 1.85k forks source link

Unclear on join-coalesce breaking change: `df.join(..., how="left", coalesce=None/True/False)` #17986

Open DeflateAwning opened 1 month ago

DeflateAwning commented 1 month ago

Proposed Change:

In the coalesce section, explain the default join type for each how option.

Description

Since the v1.0 milestone, it doesn't seem like the behaviour of df.join(df2, how='left', on/left_on/right_on=...) changed, despite the warnings in v0.20.31.

Main Request: Add an example to the join page which clarifies the behaviour of how='left' with each coalesce option. Explain the behaviour of coalesce=None for each join type (currently says "None: -> join specific.", which is sort of meaningless without any additional context).

Additional Confusion

I would have expected the following code to behave differently on v0.20.31 vs. v1.3, based on the DeprecationWarning in v0.20.31.

import polars as pl

df_customers = pl.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
    }
)

df_orders = pl.DataFrame(
    {
        "order_id": ["a", "b", "c"],
        "customer_id": [1, 2, 2],
        "amount": [100, 200, 300],
    }
)

df_orders2 = df_orders.rename({'customer_id': 'cid'})

print(df_customers.join(df_orders, on="customer_id", how="left"))

print(df_customers.join(df_orders2, left_on="customer_id", right_on='cid', how="left"))

The output on both mentioned versions is the following:

shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘

Deprecation Warnings:

DeprecationWarning: The default coalesce behavior of left join will change to `False` in the next breaking release. Pass `coalesce=True` to keep the current behavior and silence this warning.
  print(df_customers.join(df_orders, on="customer_id", how="left"))

DeprecationWarning: The default coalesce behavior of left join will change to `False` in the next breaking release. Pass `coalesce=True` to keep the current behavior and silence this warning.
  print(df_customers.join(df_orders2, left_on="customer_id", right_on='cid', how="left"))

Link

https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html#polars.DataFrame.join

DeflateAwning commented 1 month ago

After further experimentation, I've deamed that the DeprecationWarning in v0.20.31 for how='left' is an outright lie.

Opening another issue in case there's any interest in backporting to fix that DeprecationWarning.

ritchie46 commented 1 month ago

I've deamed that the DeprecationWarning in v0.20.31 for how='left' is an outright lie.

Relax there is no ill intent. I am sorry for the hassle you experienced, but we are allowed to change our mind. It is free software. You are free to open a PR to suggest the documentation changes you'd like to see.

DeflateAwning commented 1 month ago

Sorry, I didn't mean to suggest there was ill-intent with the use of the word "lie". I just meant that it was factually incorrect. I can't imagine the awesome maintainers of this project would ever have any ill-intent. Thanks a lot for your work on this project :)

I may submit the PR, but in case anyone gets to it first, I'd propose the following change:

In the coalesce section, explain the default join type for each how option.