pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Join_where doesn't support multiple binary comparisons in a single Expr #18751

Closed ion-elgreco closed 1 month ago

ion-elgreco commented 1 month ago

Checks

Reproducible example

import polars as pl
import datetime
import random

def random_date(start, end):
    """Generate a random datetime between `start` and `end`"""
    return start + datetime.timedelta(
        # Get a random amount of seconds between `start` and `end`
        seconds=random.randint(0, int((end - start).total_seconds())),
    )

df = pl.DataFrame({
    "id":list(range(0,1000))*1500,
    "start_date": [random_date(datetime.datetime(2015,1,1), datetime.datetime(2020,1,1),) for i in range(1500000)]
}).with_columns(end_date = pl.col('start_date') + pl.duration(hours=random.randint(24,240)))

parts = df.sample(1700).select(
    'id',
    pl.concat_list(pl.col('start_date','end_date')).list.mean().alias('date').cast(pl.Datetime)
)

df.lazy().join_where(parts.lazy(), (pl.col('id') == pl.col('id')) & 
    (pl.col('start_date') <= pl.col('date') ) &
    (pl.col('end_date') >= pl.col('date'))).collect(streaming=True)
InvalidOperationError: only 1 binary comparison allowed as join condition

Log output

No response

Issue description

When you provide the predicate as a single predicate with &'s it's throwing an invalidOperation.

Expected behavior

Allow a single Expr that contains multiple binary comparisons

Installed versions

``` --------Version info--------- Polars: 1.7.1 Index type: UInt32 Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39 Python: 3.10.14 (main, Aug 14 2024, 05:11:29) [Clang 18.1.8 ] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake 0.19.1 fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.22.2 openpyxl pandas pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
ritchie46 commented 1 month ago

It is by design it isn't supported yet. That's why we throw the error.

You must pass them as separate expressions.