Benchmark `.nth[_back](n)` with inputs `n`

Philippe-Cholet commented 7 months ago

I introduced specialization benchmarks in #786. However, run Combinations::nth benchmarks in #914 did not give meaningful feedback. Sure it was globally faster but we could not say how much for each n. So here I change the benchmarks to give informations based on the value n we feed nth with.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.43%. Comparing base (6814180) to head (d8f2e47). Report is 51 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #916 +/- ## ========================================== + Coverage 94.38% 94.43% +0.05% ========================================== Files 48 48 Lines 6665 6869 +204 ========================================== + Hits 6291 6487 +196 - Misses 374 382 +8 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Philippe-Cholet commented 7 months ago

@kinto-b Each Combinations::nth(n) call produced n+1 heap allocated vector items but only 1 now. So allocations decrease by 1/(n+1)-1 (computed in % below).

                       before                           after                            changes
Same allocations for:
combinations1/nth/0    [1.2090 ms 1.2121 ms 1.2157 ms]  [1.1560 ms 1.1599 ms 1.1646 ms]  [-5.1082% -4.5026% -3.7919%]
combinations2/nth/0    [1.1405 ms 1.1447 ms 1.1496 ms]  [1.1473 ms 1.1494 ms 1.1519 ms]  [+0.5059% +0.9210% +1.3303%]
combinations3/nth/0    [1.1819 ms 1.1868 ms 1.1919 ms]  [1.1895 ms 1.1907 ms 1.1919 ms]  [-2.0142% -1.0343% -0.1986%]
combinations4/nth/0    [1.3242 ms 1.3395 ms 1.3561 ms]  [1.2893 ms 1.2914 ms 1.2945 ms]  [-5.7538% -4.5556% -3.4696%]
-50% allocations for:
combinations1/nth/1    [1.2196 ms 1.2239 ms 1.2284 ms]  [639.53 µs 643.26 µs 647.97 µs]  [-49.143% -48.354% -47.710%]
combinations2/nth/1    [1.1377 ms 1.1428 ms 1.1480 ms]  [600.34 µs 601.93 µs 603.75 µs]  [-47.541% -47.277% -47.008%]
combinations3/nth/1    [1.1770 ms 1.1817 ms 1.1876 ms]  [641.31 µs 642.99 µs 644.99 µs]  [-47.351% -46.549% -45.859%]
combinations4/nth/1    [1.2842 ms 1.2854 ms 1.2869 ms]  [720.70 µs 723.56 µs 726.84 µs]  [-43.637% -43.370% -43.065%]
-67% allocations for:
combinations1/nth/2    [1.2684 ms 1.2740 ms 1.2792 ms]  [463.04 µs 463.50 µs 463.98 µs]  [-63.556% -63.375% -63.200%]
combinations2/nth/2    [1.2179 ms 1.2229 ms 1.2279 ms]  [422.52 µs 423.28 µs 424.16 µs]  [-67.348% -66.251% -65.402%]
combinations3/nth/2    [1.2612 ms 1.2660 ms 1.2712 ms]  [450.01 µs 450.41 µs 450.84 µs]  [-64.412% -64.252% -64.098%]
combinations4/nth/2    [1.3132 ms 1.3188 ms 1.3237 ms]  [499.48 µs 500.65 µs 502.05 µs]  [-62.071% -61.902% -61.722%]
-80% allocations for:
combinations1/nth/4    [1.2246 ms 1.2296 ms 1.2363 ms]  [317.72 µs 318.69 µs 319.68 µs]  [-75.067% -74.630% -74.296%]
combinations2/nth/4    [1.1881 ms 1.1911 ms 1.1947 ms]  [276.03 µs 276.49 µs 277.07 µs]  [-76.927% -76.629% -76.256%]
combinations3/nth/4    [1.2288 ms 1.2343 ms 1.2416 ms]  [306.15 µs 306.67 µs 307.22 µs]  [-75.649% -75.488% -75.343%]
combinations4/nth/4    [1.2963 ms 1.3016 ms 1.3073 ms]  [335.58 µs 337.01 µs 339.08 µs]  [-74.351% -73.952% -73.533%]
-89% allocations for:
combinations1/nth/8    [1.2747 ms 1.2808 ms 1.2862 ms]  [209.95 µs 210.26 µs 210.60 µs]  [-83.432% -83.218% -82.982%]
combinations2/nth/8    [1.1510 ms 1.1534 ms 1.1559 ms]  [191.80 µs 192.31 µs 192.92 µs]  [-83.572% -83.443% -83.343%]
combinations3/nth/8    [1.2398 ms 1.2568 ms 1.2758 ms]  [204.40 µs 204.66 µs 204.96 µs]  [-83.773% -83.538% -83.274%]
combinations4/nth/8    [1.2997 ms 1.3029 ms 1.3062 ms]  [224.81 µs 225.83 µs 227.24 µs]  [-82.748% -82.660% -82.569%]

Benchmark changes closely follow allocations changes. Nothing surprising here (heap allocations are the bottleneck) but it's definitely a better description than the global -65% the previous benchmark gave us.

Philippe-Cholet commented 7 months ago

@phimuemue I'm not sure about start.

Build (not run) those benchmarks was already slow (~2 minutes) and is probably a bit slower now. So slow 😁 that I usually comment out iterators I don't want to benchmark before I run the command. Then build is fast. If we move start from the benchmark to the inputs (where n is now), we end up with even more benchmarks (slower to build and to run). I'm not so sure consider start is meaningful at all, inside or outside. It felt like a good idea at the time but is it? We have some iterators like combinations that lazily load things at the start but is it worth it enough to consider those starts?

phimuemue commented 7 months ago

Build (not run) those benchmarks was already slow (~2 minutes) and is probably a bit slower now.

Sometime ago I tried changing bench_specializations so that it accepts one benchmark only. On my machine, this cut down compile times by 40%.

Philippe-Cholet commented 7 months ago

I found some macro magic trick to select some of the benchmarks (by prefix names with a lifetime (dirty trick)) and reduce build times. But it complexifies the macro so much that I rather occasionally and temporarily comment out parts of the declaration than update such trick when needed in the future.

@phimuemue What do you think about those starts? Should I delete them or do we merge this as is?

phimuemue commented 7 months ago

I would leave it in. I think the reason is that iterators may behave differently after they've been nexted several times (if they are similar to Chain e.g.).

We can remove it later once we are sure that it's not needed.

kinto-b commented 7 months ago

@Philippe-Cholet Ah thanks for sharing that. I am a bit confused by the fact that the 'after' times get smaller and smaller for larger and larger n. I'd have expected .nth(8) to be slower than .nth(1) no matter what. Or have I misunderstood what the times represent?

Philippe-Cholet commented 7 months ago

@kinto-b The benchmark roughly measures while let Some(_x) = it.nth(n) {} (and some start shenanigans you can ignore). With larger n, less vectors are created by nth(n) over the entire iteration, roughly iterator_length / (n + 1). Said differently, .nth(0)s creates every vector, .nth(1)s creates half, .nth(8)s creates 1/9 of them. Less vectors created leads to smaller times

rust-itertools / itertools

Benchmark `.nth[_back](n)` with inputs `n` #916

Codecov Report