Open luukschagen opened 1 year ago
There's an issue of what polars needs to know about the frame before it can do these sorts of optimizations. As Richie has mentioned before, there's no join ordering optimizations yet in polars, partially because lazyframe would need to have some idea of cardinalities before knowing what to swap.
In this case, if the tables are the same size and fairly large (I increased the sample size to 3 million), my benchmarks have the "optimized" path above as 30% slower. As you shrink the size of other_df
, you gain speed with your suggested, eventually getting faster. Since the gain/loss depends on relative table sizes and possibly other considerations, this is something that would require some research into when it's worthwhile, at least in the DataFrame context.
For LazyFrames, this would require cardinality estimation (and possibly more) that currently doesn't exist, and as far as I know isn't on the horizon.
For the default engine (which has all the data in memory), we do determine join order JIT. But I think we must first understand why the applied filter is faster. If we understand that, then we can maybe add an optimization by sampling a cardinality estimate.
However, as @mishpat shows, this is definitely something we need to do all the time. At the moment this isn't a top priority, but when I have got some spare time, I could investigate a bit more here.
I didn't mean to sound too negative on the overall goal: I work in a join-heavy environment, so any optimizations to that would be fantastic for me as well!
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Output:
Issue description
When left joining a relatively small table (query_df) on a larger table (other_df), in order to select and join a number of relevant entries from the larger table. The join can be sped up by first filtering down the larger 'other_df' with an 'is_in' filter, before doing the join.
I encountered this by accident, as my assumption would be that the internal implementation of polars would in effect do the same thing, so that explicitly doing the filter as an additional step could only slow down the join.
For completeness I also tried the lazy version, which doesn't seem to change the performance in this case, though I'm not 100% sure if my implementation of the lazy 'is_in' is the optimal/canonical one.
Expected behavior
My expectation would be that the doing a left join from a smaller table to a larger table (in effect 'selecting' the data from the larger table) would be optimized to filter the large table as efficiently as possible, so that trivially adding a filter step before the join would not be able to improve the performance (at least not in 'eager' mode).
Installed versions