pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.24k stars 1.67k forks source link

feat(rust): add cluster_with_columns plan optimization #16274

Closed coastalwhite closed 3 weeks ago

coastalwhite commented 1 month ago

This adds a new optimization pass for the query engine that clusters several sequential WITH COLUMNS calls.

When the optimizer spots a chain of WITH COLUMNS, it tries to pull expressions as close to the source as possible. If all expressions can be pulled up closer to the source, the WITH COLUMNS node is removed entirely.

ritchie46 commented 1 month ago

When the optimizer spots a chain of WITH COLUMNS, it tries to pull expressions as close to the source as possible.

I think we must try to pull as close to the first WITH_COLUMNS. Closer to the source isn't per se beneficial, as the crossing operations like Filter and Join make those node more expensive.

coastalwhite commented 1 month ago

Currently, we only optimize sequential WITH_COLUMNS calls if there is nothing in between them. So at the moment, closer to source is roughly equal to first WITH_COLUMNS.

coastalwhite commented 1 month ago

I think this CI failure is a false negative

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 88.84892% with 31 lines in your changes are missing coverage. Please review.

Project coverage is 81.41%. Comparing base (ec08d76) to head (f9bee6f). Report is 4 commits behind head on main.

Files Patch % Lines
crates/polars-arrow/src/bitmap/bitmap_ops.rs 55.17% 26 Missing :warning:
crates/polars-arrow/src/bitmap/immutable.rs 0.00% 3 Missing :warning:
...src/logical_plan/optimizer/cluster_with_columns.rs 99.35% 1 Missing :warning:
crates/polars-plan/src/logical_plan/visitor/lp.rs 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16274 +/- ## ========================================== + Coverage 81.39% 81.41% +0.01% ========================================== Files 1406 1409 +3 Lines 183953 184497 +544 Branches 2958 2960 +2 ========================================== + Hits 149731 150206 +475 - Misses 33709 33776 +67 - Partials 513 515 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

coastalwhite commented 1 month ago

I think this PR is mostly ready. Some points to specifically look at are.

ritchie46 commented 1 month ago

Nice one! I have left a few comments.

coastalwhite commented 3 weeks ago

Maybe wait with merging this. I still need to add the traversal to check whether the optimization pass is even necessary

coastalwhite commented 3 weeks ago

Okay! This is ready to get merged :rocket:

codspeed-hq[bot] commented 3 weeks ago

CodSpeed Performance Report

Merging #16274 will not alter performance

Comparing coastalwhite:opt-cluster-with-columns (f9bee6f) with main (4375930)

Summary

✅ 37 untouched benchmarks