pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.82k stars 1.92k forks source link

Non-equi join tracking issue #10068

Closed magarick closed 1 month ago

magarick commented 1 year ago

Problem description

I know there have been a few issues raised about this before. I'd like to consolidate planning here since I'm going to start working on this but it's large and will have to proceed in parts. Probably a good idea to get agreement on what the API should look like too. Further, there are sub-types of non-equi join that are conceptually distinct enough that they should probably have their own functions. And it's likely faster to implement them separately than trying to do it all one way. If conceptually distinct join types should get their own functions, this makes it easier to implement incrementally. Obviously, teaming up on this would be great too.

Subtypes noted so far

  1. Match values in the left table to intervals that contain them in the right. This is kind of like an asof join, especially since in my experience you often want the intervals in the right table to be disjoint, though missing ranges are usually ok. In that case I've called this an annotation or tag join before. When intervals can overlap, you might want to take the first or last overlapping one for a value, but defining first and last requires care.
  2. Left and right both have intervals which match when they overlap. Similar to 1 since a match is when either endpoint of an interval falls within an interval on the other side.
  3. Keys match if their difference is within some range, or interval overlaps with tolerance (like foverlaps)

So, does it make sense to have different functions for conceptually different joins. The underlying algorithm may end up being the same, since DuckDB seems to claim they can do it all super fast with one type of join (https://duckdb.org/2022/05/27/iejoin.html, https://vldb.org/pvldb/vol8/p2074-khayyat.pdf)

cmdlineluser commented 1 year ago

Incase it's useful information:

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequalities could be handled by the PieceWiseMergeJoin operator, but PWMJ does not currently handle simple equalities (the logic would just have to be extended to handle NULLs correctly).

https://stackoverflow.com/a/76483604

Originated from: #9376

avimallu commented 1 year ago

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):

DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)

with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:

df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])

Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

magarick commented 1 year ago

While DuckDB does have several non-equi-joins, the planner currently assumes that all equality predicates are more selective than inequalities and just generates a hash join

This is exactly why I'd like different functions for different concepts even if everything could be expressed through a generic non-equi join. It's more explicit and clear than trying to parse out what's happening from a generic expression that handles lots of cases.

Note also that the IEJoin algorithm requires two inequalities and the query only has one.

Single inequality might be doable with a modified asof join. Again, conceptual differences matter.

magarick commented 1 year ago

Thanks for taking this up!

While a technical discussion on the API is essential (which is way beyond my abilities), I want to put my thoughts on how the average user will specify join conditions. In SQL, we can (rather elegantly) specify join conditions like (pseudocode):

DATETRUNC('month', A.date) + DAYS(10) >= B.date
D.index - C.index + 1 >= 4
DATEDIFF('day', E.date, B.date) <= (D.index)

with or without creating special columns that host these calculations, we can't easily specify something like in Python, and consequently Polars. The most natural extension I can think of is using Polars expressions with some enchancements for column specifications in the join_inequi call. Perhaps a left.<col_name> and a right.<col_name> syntax to specify a list of expressions as the join condition? I'm thinking:

df_left.join(df_right, join_expr = [pl.col("left.D") - pl.col("right.C") > 4])

Any thoughts? Maybe the left. and right. qualifiers are necessary only when the column names are identical and need disambiguation.

This is why I want to start with simple common cases. Something like you're describing, I think, will require a lot more effort and probably more knowledge than I have of the internals currently. I know data.table has x. and i. for referring to columns in another table inside of a join, and I guess it works. But even with my standards it feels magic and awkward and confusing. A way to refer to columns in the right join table could be nice, but it also sounds like a lot of work for a limited use case. Thought the other alternatives I can think of are incredibly awkward or require "stringly typed" functions.

magarick commented 1 year ago

I think the natural starting point is a "range join" as it's similar to an asof with some additional complexity, it's very common to want, and it's a clear semantic category.

Within this, there are two operations that could potentially be considered distinct enough:

Point on the left, interval on the right:

  1. In addition to any equality terms we want $L \in [R_1, R_2)$ where the intervals could also be open, closed, or right-closed.
  2. In an asof join, there's a clear "nearest" point, but now no longer. However in many cases you have a preference for which interval matches to $L$ if they overlap. If you don't want all matches, you might want the row where $R_1$ is the largest value $\geq L$ or $R_2$ is the smallest $\leq L$. In this case, it's an asof join with variable look{ahead,behind}.
  3. Sometimes you only want to join if each point on the left uniquely matches an interval on the right. Often by making sure the intervals are disjoint per group. I've done this kind of tagging a lot withdata.table. For example, you could have customer arrival times on the left and number of employees available on the right. But I don't know if this should be directly checked by the join. It does bring up the question of utilities for handling intervals (combining, splitting to disjoint, checking for overlaps, etc.)
  4. In my experience, the right table is usually much smaller than the left for this case. Maybe that makes a difference.

Interval on both sides:

  1. This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.
  2. It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.
  3. However, this type of join also adds the complexity of different types of overlap. You might want to exclude cases where one interval completely contains another, for instance. I don't know if this is something that needs to be done in the first iteration.
avimallu commented 1 year ago
  1. This would be like data.table's foverlaps. That one also allows you to specify a tolerance for near matches which I think is useful in genomics? They only allow closed intervals. Not sure exactly why.

The author, Arun, mentioned he wrote this code in 4 hours for a conference he was attending. Don't remember where though; and it looks like it wasn't worked on much after that. That's probably why the tolerance was never implemented, and the function isn't very configurable.

  1. It could be implemented by checking the endpoints for each interval on one side with the interval of the other and declaring a match if either succeeds. This is how data.table does it and I can't think of anything better off the top of my head.

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: https://github.com/pola-rs/polars/issues/9467 It was able to closely match a DuckDB solution that used interval joins.

magarick commented 1 year ago

cmdlineuser came up with a very fast implementation in py-polars based on existing functions. You can check it out: #9467 It was able to closely match a DuckDB solution that used interval joins.

That doesn't look like it even needs a join. But it does need some interval manipulation tools, which I think there's also demand for. Could be worth filing a separate issue.

Hoeze commented 1 year ago

Cross-linking https://github.com/pola-rs/polars/issues/6856. Interval joins are the last missing piece to replace pyranges in all of my code.

kszlim commented 6 months ago

Just a 👍 as this seems like a very useful feature that would help my use cases a fair bit.

Nicolas-SB commented 4 months ago

+1

nikita-balyschew-db commented 4 months ago

+1

jshinonome commented 4 months ago

+1

This allows to perform wj of kdb, last feature to replace kdb.

adamreeve commented 3 months ago

Hi @ritchie46, the Polars 1.0 release announcement mentioned non-equi join support under "Other short term plans". Are you able to provide any more details on those plans here? I was thinking about implementing the first join subtype mentioned in this issue description (match a value from one table with an interval in the other), but won't do anything if someone is already planning on working on this.

ritchie46 commented 2 months ago

Hey @adamreeve, sorry for the late reply. Missed this. Yes, I want to implement this join type https://vldb.org/pvldb/vol8/p2074-khayyat.pdf

This includes multiple range joins. I am not entirely sure about the interface yet, but the backend can already be started. Can you maybe ping me on discord?

iliya-malecki commented 2 months ago

Hi @ritchie46 does this mean i can hope that in the future any expression that returns a boolean can be a valid join condition? e.g. pl.col('a').floordiv(42).is_between('x', 'y')?

adamreeve commented 2 months ago

I've started working on this and have a working implementation of the IEJoin algorithm: https://github.com/adamreeve/polars/compare/main...iejoin

The Khayyat et al. paper doesn't account for duplicate values so I've also used some ideas from the DuckDB article.

I've just hacked this in as a DataFrame method initially to allow easy testing, but will start looking into integrating it as a proper join type.

adamreeve commented 1 month ago

I've opened a draft PR that adds the IEJoin type but I think the API needs some discussion before this will be ready: https://github.com/pola-rs/polars/pull/18365

ritchie46 commented 1 month ago

Added in #18365