pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.47k stars 1.98k forks source link

Support additional join types with join_where #18669

Open adamreeve opened 2 months ago

adamreeve commented 2 months ago

Description

The join_where method added in #18365 behaves like an inner join, where only rows from the left and right DataFrames that match the condition are returned.

It would be useful to support other join types like left, full, anti and semi.

I'm not sure how feasible it is to implement this, it seems like it would be straightforward when the join resolves to a single equi join, or single iejoin, but handling additional filter conditions could be tricky.

ritchie46 commented 2 months ago

For equality predicates this is fairly trivial. Choose the equality predicates as outer join and add the remaining as filters. For inequality I am not entirely sure how to interpret that yet. :thinking:

I think the first goal is to support the single inequality predicate case, as that's low hanging fruit atm!

AlexeyDmitriev commented 2 months ago

I think interpretation should be the same as with ON statement in SQL:

You return all the pair of rows that match all the condiitions in on (condition is calculated in terms of result)

But for left/join/outer join you'll return also rows from left/right/both table that don't have a pair

adamreeve commented 2 months ago

I think the first goal is to support the single inequality predicate case, as that's low hanging fruit atm!

Yep I've made a start on this, I should hopefully have something to show soon

adamreeve commented 3 days ago

I've started looking into this and have pushed a commit with my idea of what the Python API and behaviour should look like: https://github.com/adamreeve/polars/commit/7b35d0984e73a2d16aa0d3bfcd4fa714cf13b3d0

Actually implementing this looks like it could be a fair bit of work though. For a join on equality conditions only, we could trivially convert this to an inner, left, right, full, semi or outer join. And for a join using one or two inequalities, we could fairly easily update the way join results are materialised in the IEJoin implementation to handle the different join types. But any joins that end up with extra conditions in remaining_preds in resolve_join_where would be trickier to handle.

It looks like rather than adding an extra post-processing node after the join, the best way forward would be to add support for extra non-equality predicates to all of the join implementations in order to handle this correctly and efficiently.

Consider using the existing approach of a post-processing filter for left joins for example. If we do a left join or a left IE-join and then want to apply extra join predicates in a post-processing step, we could replace values in RHS table columns with nulls where any of the new conditions are false, but would end up with extra result rows for one LHS table row if it matches with multiple RHS rows and some are filtered out in post-processing. We could probably work around that by adding a temporary row index column, but this would get pretty clunky. Semi and anti joins would be more of a problem, we'd need to do an inner join first to keep enough information to apply the extra predicates, making this a lot less efficient than it should be.