Closed magarick closed 1 month ago
Incase it's useful information:
While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join
Note also that the IEJoin algorithm requires two inequalities and the query only has one.
Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).
https://stackoverflow.com/a/76483604
Originated from: #9376
Thanks for taking this up!
While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):
DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)
with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi
call. Perhaps a left.<col_name>
and a right.<col_name>
syntax to specify a list of expressions as the join condition? I'm thinking:
df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])
Any thoughts? Maybe the left.
and right.
qualifiers are necessary only when the column names are identical and need disambiguation.
While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join
This is exactly why I'd like different functions for different concepts even if everything could be expressed through a generic non-equi join. It's more explicit and clear than trying to parse out what's happening from a generic expression that handles lots of cases.
Note also that the IEJoin algorithm requires two inequalities and the query only has one.
Single inequality might be doable with a modified asof join. Again, conceptual differences matter.
Thanks for taking this up!
While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):
DATETRUNC('month', A.date) + DAYS(10) >= B.date D.index - C.index + 1 >= 4 DATEDIFF('day', E.date, B.date) <= (D.index)
with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the
join_inequi
call. Perhaps aleft.<col_name>
and aright.<col_name>
syntax to specify a list of expressions as the join condition? I'm thinking:df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])
Any thoughts? Maybe the
left.
andright.
qualifiers are necessary only when the column names are identical and need disambiguation.
This is why I want to start with simple common cases. Something like you're describing, I think, will require a lot more effort and probably more knowledge than I have of the internals currently. I know data.table has x.
and i.
for referring to columns in another table inside of a join, and I guess it works. But even with my standards it feels magic and awkward and confusing. A way to refer to columns in the right join table could be nice, but it also sounds like a lot of work for a limited use case. Thought the other alternatives I can think of are incredibly awkward or require "stringly typed" functions.
I think the natural starting point is a "range join" as it's similar to an asof with some additional complexity, it's very common to want, and it's a clear semantic category.
Within this, there are two operations that could potentially be considered distinct enough:
Point on the left, interval on the right:
data.table
. For example, you could have customer arrival times on the left and number of employees available on the right. But I don't know if this should be directly checked by the join. It does bring up the question of utilities for handling intervals (combining, splitting to disjoint, checking for overlaps, etc.)Interval on both sides:
data.table
's foverlaps
. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.data.table
does it and I can't think of anything better off the top of my head.
- This would be like
data.table
'sfoverlaps
. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.
The author, Arun, mentioned he wrote this code in 4 hours for a conference he was attending. Don't remember where though; and it looks like it wasn't worked on much after that. That's probably why the tolerance was never implemented, and the function isn't very configurable.
- It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how
data.table
does it and I can't think of anything better off the top of my head.
cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: https://github.com/pola-rs/polars/issues/9467 It was able to closely match a DuckDB solution that used interval joins.
cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467 It was able to closely match a DuckDB solution that used interval joins.
That doesn't look like it even needs a join. But it does need some interval manipulation tools, which I think there's also demand for. Could be worth filing a separate issue.
Cross-linking https://github.com/pola-rs/polars/issues/6856. Interval joins are the last missing piece to replace pyranges in all of my code.
Just a 👍 as this seems like a very useful feature that would help my use cases a fair bit.
+1
+1
+1
This allows to perform wj of kdb, last feature to replace kdb.
Hi @ritchie46, the Polars 1.0 release announcement mentioned non-equi join support under "Other short term plans". Are you able to provide any more details on those plans here? I was thinking about implementing the first join subtype mentioned in this issue description (match a value from one table with an interval in the other), but won't do anything if someone is already planning on working on this.
Hey @adamreeve, sorry for the late reply. Missed this. Yes, I want to implement this join type https://vldb.org/pvldb/vol8/p2074-khayyat.pdf
This includes multiple range joins. I am not entirely sure about the interface yet, but the backend can already be started. Can you maybe ping me on discord?
Hi @ritchie46 does this mean i can hope that in the future any expression that returns a boolean can be a valid join condition? e.g. pl.col('a').floordiv(42).is_between('x', 'y')
?
I've started working on this and have a working implementation of the IEJoin algorithm: https://github.com/adamreeve/polars/compare/main...iejoin
The Khayyat et al. paper doesn't account for duplicate values so I've also used some ideas from the DuckDB article.
I've just hacked this in as a DataFrame method initially to allow easy testing, but will start looking into integrating it as a proper join type.
I've opened a draft PR that adds the IEJoin type but I think the API needs some discussion before this will be ready: https://github.com/pola-rs/polars/pull/18365
Added in #18365
Problem description
I know there have been a few issues raised about this before. I'd like to consolidate planning here since I'm going to start working on this but it's large and will have to proceed in parts. Probably a good idea to get agreement on what the API should look like too. Further, there are sub-types of non-equi join that are conceptually distinct enough that they should probably have their own functions. And it's likely faster to implement them separately than trying to do it all one way. If conceptually distinct join types should get their own functions, this makes it easier to implement incrementally. Obviously, teaming up on this would be great too.
Subtypes noted so far
foverlaps
)So, does it make sense to have different functions for conceptually different joins. The underlying algorithm may end up being the same, since DuckDB seems to claim they can do it all super fast with one type of join (https://duckdb.org/2022/05/27/iejoin.html, https://vldb.org/pvldb/vol8/p2074-khayyat.pdf)