pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

Not able to filter on JOIN level for SQL interface #18654

Open HectorPascual opened 1 month ago

HectorPascual commented 1 month ago

Checks

Reproducible example

import polars as pl

df1 = pl.DataFrame({"c1": ["aaa", "hello", "xxx"]})
df2 = pl.DataFrame({"c1": ["bbb", "hello", "xxx"]})

pl.sql("""
  SELECT * FROM df1
  LEFT JOIN df2
    ON df1.c1 = df2.c1 AND df1.c1 = 'hello'
""").collect()

Log output

JOIN clauses support '=' constraints on identifiers; found lhs=CompoundIdentifier([Ident...

Issue description

Hello,

I am apparently unable to run filters while joining, a possible workaround is to bring the filter down to a WHERE clause.

I can append AND clauses to the JOIN statement only if a column is involved in both sides (i.e : AND col1=col2 AND col3=col4) but if attempting to filter with a value it fails with the following error :

JOIN clauses support '=' constraints on identifiers; found lhs=CompoundIdentifier([Ident

Expected behavior

I would expect to be able to LEFT JOIN ON col1=col2 AND col1="hello".

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.10.2 (main, Jul 8 2024, 10:42:19) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake 0.19.1 fastexcel fsspec 2024.6.1 gevent great_tables matplotlib nest_asyncio numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic 2.9.0b2 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
alexander-beedie commented 1 month ago

Thanks. Will be looking at joins shortly (to add support for the new inequality join types), so will dig in to this. (Updated the given example so it can run as-is).

HectorPascual commented 1 month ago

Thanks for the fast reply.

If I move the filter to the WHERE clause, would the execution plan / performance be similar? Thanks!

Also I realized it also breaks in the following case (when applying a function to a column in the join) :

import polars as pl

df1 = pl.DataFrame({"c1": ["aaa", "hello", "xxx"],"c2":["1","0","0"]})
df2 = pl.DataFrame({"c1": ["bbb", "hello", "xxx"], "c2":[1,0,0]})

pl.sql("""
  SELECT * FROM df1
  LEFT JOIN df2
    ON df1.c1 = df2.c1 AND df1.c2 = CAST(df1.c2 as STRING)
""").collect()
alexander-beedie commented 1 day ago

If I move the filter to the WHERE clause, would the execution plan / performance be similar? Thanks!

Should be identical ;)