pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

join_asof does not verify order if by argument is given #18014

Open MariusMerkleQC opened 1 month ago

MariusMerkleQC commented 1 month ago

Checks

Reproducible example

import polars as pl
from datetime import datetime

df_left = pl.DataFrame(
    data=[(datetime(2023, 1, 1, 0, 0, 0), "a")],
    schema={"timestamp": pl.Datetime, "key": pl.Utf8},
    orient="row",
)

df_right = pl.DataFrame(
    data=[
        (datetime(2023, 6, 1, 0, 0, 0), 1, "a"),
        (datetime(2022, 6, 1, 0, 0, 0), 3, "a"),
    ],
    schema={"timestamp": pl.Datetime, "value": pl.Int32, "key": pl.Utf8},
    orient="row",
)

df_joined = df_left.join_asof(
    other=df_right, by="key", on="timestamp", strategy="backward"
)

Log output

No response

Issue description

Even though df_right is not sorted, the code runs without interruption and will have an incorrect joined result.

Expected behavior

It should raise the following error instead. It does so if you remove by="key" above.

polars.exceptions.InvalidOperationError: argument in operation 'asof_join' is not sorted, please sort the 'expr/series/column' first

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.4 pandas: 2.2.2 pyarrow: 15.0.2 pydantic: 2.8.2 pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: 3.1.9 None ```
kszlim commented 1 month ago

Yeah, this seems like a big footgun to me, I've been bitten by it before, and I made sure all my inputs to join_asof are defensively sorted, but it comes at a largeish CPU and memory cost. I'd prefer if it could just fail/throw (perhaps with a verify_sorted flag if people really don't want to pay the cost to check for sortedness).

ritchie46 commented 1 month ago

Yes, we should check this.

agossard commented 1 month ago

I was looking at this yesterday. We check sortedness in a asof_join w/o by groups, but explicitly don't check it when there are groups. I'm assuming this is because forcing the entire on column to be sorted is too restrictive... since the actual required condition is that the on column is sorted within each by group.

I could try implementing an up front check for that condition, unless someone else is already on it, or you think it's too complicated for a new contributor.

kszlim commented 1 month ago

I was looking at this yesterday. We check sortedness in a asof_join w/o by groups, but explicitly don't check it when there are groups. I'm assuming this is because forcing the entire on column to be sorted is too restrictive... since the actual required condition is that the on column is sorted within each by group.

I could try implementing an up front check for that condition, unless someone else is already on it, or you think it's too complicated for a new contributor.

Is it possible to check for sortedness within a group or is that too expensive?

agossard commented 1 month ago

I think we might be able to check for this condition in the process of doing the join itself, rather than trying to verify it up front.

My thought was to check this in the AsOfJoinStrategy .next() function, make that function and its caller return PolarsResults, and propagate the error back up to where we can call polars_bail. Let me know if that sounds like going down the wrong path.