Closed kinto-b closed 7 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.42%. Comparing base (
6814180
) to head (ca57605
). Report is 55 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Oh saving the baseline is so useful, thanks for pointing that out! Compared to master with the change above it actually gets a little slower
$ cargo bench --bench specializations merge_by/fold -- --baseline merge_by
merge_by/fold time: [1.0928 µs 1.1066 µs 1.1247 µs]
change: [-65.005% -64.657% -64.310%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) high mild
7 (7.00%) high severe
I've noticed in the past that match can often be much faster than comparable expressions, so perhaps the compiler has a better time with the more verbose code.
Let me know if you want me to push the patch or leave as is.
I wondered, I checked out your branch, experimented, benchmarked but still wonder.
I eventually noticed that the benchmark I wrote only has zeros. Apart from being unrealistic which can be fine, I guess it repeatedly uses only one arm of the big match statement which is more problematic.
I changed vec![0; X]
s to (0..X).collect_vec()
s in the four related benchmarks to have more realistic use of the different code branches.
Please update this.
https://github.com/rust-itertools/itertools/blob/f7d557d5c78c7bd8f65d43ffeaf65452d2da3f53/benches/specializations.rs#L583-L610
cargo bench --bench specializations "^merge.*/fold" -- --baseline merge
# Yes, this iterator implementation is used by 4 itertools adaptors.
New baseline (with `(0..X).collect_vec()`s):
merge/fold [3.2654 µs 3.2811 µs 3.3028 µs]
merge_by/fold [4.6043 µs 4.6183 µs 4.6351 µs]
merge_join_by_ordering/fold [2.4700 µs 2.4762 µs 2.4842 µs]
merge_join_by_bool/fold [3.3230 µs 3.3309 µs 3.3393 µs]
Your big match statement:
merge/fold [1.2520 µs1 1.2536 µs 1.2553 µs] [-61.495% -60.671% -59.636%]
merge_by/fold [867.37 ns 868.89 ns 871.06 ns] [-81.308% -81.225% -81.142%]
merge_join_by_ordering/fold [1.3648 µs 1.3709 µs 1.3816 µs] [-45.770% -45.250% -44.809%]
merge_join_by_bool/fold [1.3348 µs 1.3394 µs 1.3443 µs] [-60.262% -60.076% -59.903%]
My shorter code:
merge/fold [1.6499 µs 1.6590 µs 1.6703 µs] [-49.632% -49.376% -49.111%]
merge_by/fold [1.0078 µs 1.0100 µs 1.0125 µs] [-78.204% -78.098% -77.993%]
merge_join_by_ordering/fold [1.5552 µs 1.5570 µs 1.5588 µs] [-38.101% -37.507% -37.015%]
merge_join_by_bool/fold [1.3369 µs 1.3386 µs 1.3406 µs] [-60.141% -59.983% -59.840%]
The thing that bothers me is that our benchmark is still quite basic: adapted iterators are iterated slices, function f
simple integer comparison. I did tend to think that avoid duplication of f, left.next, right.next
calls might be beneficial in the general case, but maybe I'm just plain wrong though.
I would like a more experienced opinion here. @phimuemue What do you think?
@phimuemue Yep, good call. The speed-up is pretty large (n.b. timings are using @Philippe-Cholet's modification). Happy to submit a separate PR removing these specialisations or add it to this one. Whatever y'all prefer
.count()
$ cargo bench --bench specializations "^merge.*/count"
merge/count time: [468.60 ns 474.97 ns 483.54 ns]
change: [-72.222% -71.218% -70.192%] (p = 0.00 < 0.05)
Performance has improved.
merge_by/count time: [687.13 ns 691.60 ns 697.88 ns]
change: [-59.070% -58.742% -58.424%] (p = 0.00 < 0.05)
Performance has improved.
merge_join_by_ordering/count
time: [1.2102 µs 1.2263 µs 1.2487 µs]
change: [+6.6034% +8.7574% +11.219%] (p = 0.00 < 0.05)
Performance has regressed.
merge_join_by_bool/count
time: [696.11 ns 710.89 ns 729.69 ns]
change: [-59.560% -58.013% -56.655%] (p = 0.00 < 0.05)
Performance has improved.
.last()
$ cargo bench --bench specializations "^merge.*/last"
merge/last time: [298.07 ns 304.83 ns 312.70 ns]
change: [-87.829% -87.624% -87.408%] (p = 0.00 < 0.05)
Performance has improved.
merge_by/last time: [291.02 ns 292.75 ns 294.93 ns]
change: [-87.791% -87.594% -87.380%] (p = 0.00 < 0.05)
Performance has improved.
merge_join_by_ordering/last
time: [1.1109 µs 1.1148 µs 1.1199 µs]
change: [-30.084% -28.883% -27.828%] (p = 0.00 < 0.05)
Performance has improved.
merge_join_by_bool/last time: [573.23 ns 576.44 ns 580.60 ns]
change: [-70.422% -70.058% -69.739%] (p = 0.00 < 0.05)
Performance has improved.
~Honestly, I'm not sure what's the point of getting the last or count of "merge two iterators": the count is basically iter1.count() + iter2.count()
and last is either iter1.last()
or iter2.last()
based on how it's merged... except such thing change the consumption order which matters (only) for iterators with side-effects.~ (EDIT: No I'm dumb, if items are EitherOrBoth<L, R>
, then the count can be less than count1+count2
.)
It pains me but the some_iter.into_parts().1.count()
part can be faster than with fold
if the underlying iterators specialize count/last
. But I'm tempted anyway to remove these two specializations with fold
being specialized. It should be good enough.
I imagine this PR in 4 commits:
MergeBy
-related benchmarks
I think (0..X).collect_vec()
s are good enough because merge two of them interleave them and not just join them. (Then the benchmark baseline is here.)MergeBy::fold
We need to decide so let's arbitrarly go with the verbose big match with 4 arms (one unreachable) and without wildcards. It can be changed later if there is evidence my version is better in more complicated situations. Problem with .or_else
is that it's tied to (Option, Option, ..)
returned by OrderingOrBool::merge
.MergeBy::{count,last}
or not?! @phimuemue (EDIT: yes remove)OrderingOrBool::merge
returns (Option<Either<L, R>>, ..)
" to discard the unreachable case.
If you struggle too much with this, I might do it in a next PR.Sounds like a plan, @Philippe-Cholet.
Remove
MergeBy::{count,last}
or not?! @phimuemue
Let's remove it and comment that we could further specialize count
for the respective variants.
Easy, I'll sort it out tomorrow. Thanks both :)
Ok sorted. Simplifying OrderingBool::merge
didn't impact fold performance
Relates to #755