pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.57k stars 1.98k forks source link

perf: fast path to generate group idxs for vanilla int_range in group_by_dynamic #19932

Open jvdd opened 5 days ago

jvdd commented 5 days ago

Performance optimization of group_by_dynamic when passing a vanillaint_range (i.e., start=0, step=1) as index_column. If we know that the index column for the dynamic group by is an int range, we can generate the group indices (as there is a fixed step between the index values) and can thus avoid the slow group_by_windows function.

Any feedback to improve this PR is welcome :)

Further enhancements (that could? be done):


Using the updated code, I observe ~10x performance improvements on my machine

codecov[bot] commented 5 days ago

Codecov Report

Attention: Patch coverage is 17.54386% with 94 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (414d883) to head (f98b8fc). Report is 37 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-time/src/windows/group_by.rs 0.00% 81 Missing :warning:
crates/polars-time/src/group_by/dynamic.rs 60.00% 10 Missing :warning:
crates/polars-lazy/src/frame/mod.rs 62.50% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #19932 +/- ## =========================================== + Coverage 59.28% 79.47% +20.18% =========================================== Files 1555 1562 +7 Lines 216180 217047 +867 Branches 2456 2459 +3 =========================================== + Hits 128155 172488 +44333 + Misses 87467 44000 -43467 - Partials 558 559 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.