Implement `MergeBy::fold`

kinto-b commented 7 months ago

Relates to #755

cargo bench --bench specializations merge_by/fold

merge_by/fold           time:   [741.63 ns 743.35 ns 745.12 ns]
                        change: [-75.583% -75.488% -75.382%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.42%. Comparing base (6814180) to head (ca57605). Report is 55 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #920 +/- ## ========================================== + Coverage 94.38% 94.42% +0.03% ========================================== Files 48 48 Lines 6665 6872 +207 ========================================== + Hits 6291 6489 +198 - Misses 374 383 +9 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

kinto-b commented 7 months ago

Oh saving the baseline is so useful, thanks for pointing that out! Compared to master with the change above it actually gets a little slower

$ cargo bench --bench specializations merge_by/fold -- --baseline merge_by

merge_by/fold           time:   [1.0928 µs 1.1066 µs 1.1247 µs]
                        change: [-65.005% -64.657% -64.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

I've noticed in the past that match can often be much faster than comparable expressions, so perhaps the compiler has a better time with the more verbose code.

Let me know if you want me to push the patch or leave as is.

Philippe-Cholet commented 7 months ago

I wondered, I checked out your branch, experimented, benchmarked but still wonder.

I eventually noticed that the benchmark I wrote only has zeros. Apart from being unrealistic which can be fine, I guess it repeatedly uses only one arm of the big match statement which is more problematic. I changed vec![0; X]s to (0..X).collect_vec()s in the four related benchmarks to have more realistic use of the different code branches. Please update this. https://github.com/rust-itertools/itertools/blob/f7d557d5c78c7bd8f65d43ffeaf65452d2da3f53/benches/specializations.rs#L583-L610

cargo bench --bench specializations "^merge.*/fold" -- --baseline merge
# Yes, this iterator implementation is used by 4 itertools adaptors.

New baseline (with `(0..X).collect_vec()`s):
merge/fold                   [3.2654 µs 3.2811 µs 3.3028 µs]
merge_by/fold                [4.6043 µs 4.6183 µs 4.6351 µs]
merge_join_by_ordering/fold  [2.4700 µs 2.4762 µs 2.4842 µs]
merge_join_by_bool/fold      [3.3230 µs 3.3309 µs 3.3393 µs]

Your big match statement:
merge/fold                   [1.2520 µs1 1.2536 µs 1.2553 µs] [-61.495% -60.671% -59.636%]
merge_by/fold                [867.37 ns 868.89 ns 871.06 ns] [-81.308% -81.225% -81.142%]
merge_join_by_ordering/fold  [1.3648 µs 1.3709 µs 1.3816 µs] [-45.770% -45.250% -44.809%]
merge_join_by_bool/fold      [1.3348 µs 1.3394 µs 1.3443 µs] [-60.262% -60.076% -59.903%]

My shorter code:
merge/fold                   [1.6499 µs 1.6590 µs 1.6703 µs] [-49.632% -49.376% -49.111%]
merge_by/fold                [1.0078 µs 1.0100 µs 1.0125 µs] [-78.204% -78.098% -77.993%]
merge_join_by_ordering/fold  [1.5552 µs 1.5570 µs 1.5588 µs] [-38.101% -37.507% -37.015%]
merge_join_by_bool/fold      [1.3369 µs 1.3386 µs 1.3406 µs] [-60.141% -59.983% -59.840%]

The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f simple integer comparison. I did tend to think that avoid duplication of f, left.next, right.next calls might be beneficial in the general case, but maybe I'm just plain wrong though.

I would like a more experienced opinion here. @phimuemue What do you think?

kinto-b commented 7 months ago

@phimuemue Yep, good call. The speed-up is pretty large (n.b. timings are using @Philippe-Cholet's modification). Happy to submit a separate PR removing these specialisations or add it to this one. Whatever y'all prefer

`.count()`

$ cargo bench --bench specializations "^merge.*/count"

merge/count             time:   [468.60 ns 474.97 ns 483.54 ns]
                        change: [-72.222% -71.218% -70.192%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/count          time:   [687.13 ns 691.60 ns 697.88 ns]
                        change: [-59.070% -58.742% -58.424%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/count
                        time:   [1.2102 µs 1.2263 µs 1.2487 µs]
                        change: [+6.6034% +8.7574% +11.219%] (p = 0.00 < 0.05)
                        Performance has regressed.

merge_join_by_bool/count
                        time:   [696.11 ns 710.89 ns 729.69 ns]
                        change: [-59.560% -58.013% -56.655%] (p = 0.00 < 0.05)
                        Performance has improved.

`.last()`

$ cargo bench --bench specializations "^merge.*/last"

merge/last              time:   [298.07 ns 304.83 ns 312.70 ns]
                        change: [-87.829% -87.624% -87.408%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_by/last           time:   [291.02 ns 292.75 ns 294.93 ns]
                        change: [-87.791% -87.594% -87.380%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_ordering/last
                        time:   [1.1109 µs 1.1148 µs 1.1199 µs]
                        change: [-30.084% -28.883% -27.828%] (p = 0.00 < 0.05)
                        Performance has improved.

merge_join_by_bool/last time:   [573.23 ns 576.44 ns 580.60 ns]
                        change: [-70.422% -70.058% -69.739%] (p = 0.00 < 0.05)
                        Performance has improved.

Philippe-Cholet commented 7 months ago

~Honestly, I'm not sure what's the point of getting the last or count of "merge two iterators": the count is basically iter1.count() + iter2.count() and last is either iter1.last() or iter2.last() based on how it's merged... except such thing change the consumption order which matters (only) for iterators with side-effects.~ (EDIT: No I'm dumb, if items are EitherOrBoth<L, R>, then the count can be less than count1+count2.) It pains me but the some_iter.into_parts().1.count() part can be faster than with fold if the underlying iterators specialize count/last. But I'm tempted anyway to remove these two specializations with fold being specialized. It should be good enough.

I imagine this PR in 4 commits:

Update MergeBy-related benchmarks I think (0..X).collect_vec()s are good enough because merge two of them interleave them and not just join them. (Then the benchmark baseline is here.)
Specialize MergeBy::fold We need to decide so let's arbitrarly go with the verbose big match with 4 arms (one unreachable) and without wildcards. It can be changed later if there is evidence my version is better in more complicated situations. Problem with .or_else is that it's tied to (Option, Option, ..) returned by OrderingOrBool::merge.
Remove MergeBy::{count,last} or not?! @phimuemue (EDIT: yes remove)
Try to make "OrderingOrBool::merge returns (Option<Either<L, R>>, ..)" to discard the unreachable case. If you struggle too much with this, I might do it in a next PR.

phimuemue commented 7 months ago

Sounds like a plan, @Philippe-Cholet.

Remove MergeBy::{count,last} or not?! @phimuemue

Let's remove it and comment that we could further specialize count for the respective variants.

kinto-b commented 7 months ago

Easy, I'll sort it out tomorrow. Thanks both :)

kinto-b commented 7 months ago

Ok sorted. Simplifying OrderingBool::merge didn't impact fold performance

rust-itertools / itertools