pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.32k stars 1.96k forks source link

Supporting boolean returning functions/methods in `join_where` #19654

Open henryharbeck opened 1 week ago

henryharbeck commented 1 week ago

Description

Consider the below examples

# Example 1
urls = pl.DataFrame({"url": "abcd.com/page"})
categories = pl.DataFrame({"base_url": "abcd.com", "category": "landing page"})
urls.join_where(categories, pl.col("url").str.starts_with(pl.col("base_url")))
# InvalidOperationError: only 1 binary comparison allowed as join condition

# Must resort to cross join then filter instead - produces expected result
urls.join(categories, how="cross").filter(pl.col("url").str.starts_with(pl.col("base_url")))

# Example 2
a = pl.DataFrame({"change": [1, -5]})
b = pl.DataFrame({"sets": [[0, 1], [2, 3]], "category": ["bad", "good"]})
a.join_where(b, pl.col("change").is_in(pl.col("sets")))
# InvalidOperationError: only 1 binary comparison allowed as join condition

# Must resort to cross join then filter instead - produces expected result
a.join(b, how="cross").filter(pl.col("change").is_in(pl.col("sets")))

Requested based on this SO question

ritchie46 commented 1 week ago

Yes, we will. We first need to support a nested loop join, so that you don't require an cartesian product in memory.