pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.23k stars 1.84k forks source link

`join_nulls` in "asof" join #17886

Open MariusMerkleQC opened 1 month ago

MariusMerkleQC commented 1 month ago

Description

Would it be possible to add the join_nulls: bool = False argument to the .join_asof() function, as it is also available to the .join() function?

I have a use case where I want to join two data frames using the "asof" logic, and I'd also like to join when the join keys (on/left_on/right_on) are Null. I would also be interested whether there is a workaround in the mean time.

mcrumiller commented 1 month ago

Can you provide an example including the desired result? Do you mean when the left frame is null, or when the right frame does not have a matching row but does have a null value?

MariusMerkleQC commented 1 month ago

Does this example clarify the desired result?

import polars as pl
from datetime import datetime

df_expected = pl.DataFrame(
    data=[
        (None, datetime(2024, 1, 1, 0, 0, 0), 5),
        ("a", datetime(2024, 1, 1, 0, 0, 0), 5),
    ],
    schema={"category": pl.Utf8, "timestamp": pl.Datetime, "value": pl.Int8},
    orient="row",
)

df_left = df_expected.drop("value")

df_right = pl.DataFrame(
    data=[
        (None, datetime(2023, 1, 1, 0, 0, 0), 5),
        ("a", datetime(2023, 1, 1, 0, 0, 0), 5),
    ],
    schema={"category": pl.Utf8, "timestamp": pl.Datetime, "value": pl.Int8},
    orient="row",
)

df_actual = df_left.join_asof(
    other=df_right,
    on="timestamp",
    by=["category"],
    strategy="backward",  # join_nulls=True
)
mcrumiller commented 1 month ago

I see--I believe you're asking that the initial by=... include the ability to join on nulls.

I think there is an issue which is that joining on nulls produces the cartesian product of the matching records, and these are not guaranteed to have a sorted output order, which is a requirement of join_asof. But of course if the join itself is producing those records, it could probably sort them.