Adding unrelated code affects benchmark

ChristopherBiscardi commented 7 months ago

Heya! I've been doing advent-of-code and trying out Divan while doing so. Someone showed me some code and I added it to my project to benchmark. Simply adding the code to lib.rs caused one of my benchmarks to rise by 100 microseconds. Commenting or uncommenting this one line (which is not used in the benchmark) in lib.rs causes the part2_nom benchmark to inflate or deflate accordingly.

// pub mod part2_subject;

I've pared the example down as much as I could at the moment. I can try to make a smaller one if I have more time.

https://github.com/ChristopherBiscardi/advent-of-code/tree/divan-benchmark-inflation

This should be enough to run the benchmarks:

cargo bench -p day-01

benchmarks with subject module

day_01                 fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ part1               35.7 µs       │ 93.41 µs      │ 36.14 µs      │ 37.26 µs      │ 100     │ 100
├─ part2               134.3 µs      │ 163.5 µs      │ 134.7 µs      │ 137.1 µs      │ 100     │ 100
├─ part2_aho_corasick  26.7 ms       │ 37.09 ms      │ 27.08 ms      │ 27.31 ms      │ 100     │ 100
├─ part2_nom           467.4 µs      │ 568.4 µs      │ 472.3 µs      │ 480.6 µs      │ 100     │ 100
╰─ part2_subject       135.7 µs      │ 183.7 µs      │ 136.2 µs      │ 138.5 µs      │ 100     │ 100

benchmarks without subject module

day_01                 fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ part1               33.58 µs      │ 99.66 µs      │ 33.99 µs      │ 35.05 µs      │ 100     │ 100
├─ part2               134.9 µs      │ 222.4 µs      │ 139.5 µs      │ 141.3 µs      │ 100     │ 100
├─ part2_aho_corasick  27.6 ms       │ 29.34 ms      │ 28.48 ms      │ 28.41 ms      │ 100     │ 100
╰─ part2_nom           361.2 µs      │ 455.5 µs      │ 375.9 µs      │ 378.3 µs      │ 100     │ 100

Here's each run side-by-side for ease of viewing

├─ part2_nom           467.4 µs      │ 568.4 µs      │ 472.3 µs      │ 480.6 µs      │ 100     │ 100
╰─ part2_nom           361.2 µs      │ 455.5 µs      │ 375.9 µs      │ 378.3 µs      │ 100     │ 100

nvzqz commented 7 months ago

Is the increase of 100µs consistent across multiple cargo bench invocations?

Also, I noticed the spike of downloads yesterday and found your AoC video. Thanks for trying it out! Currently in the growing pains phase. 😄

ChristopherBiscardi commented 7 months ago

The increase is consistent but I wasn't tracking the exact amount closely enough to say whether it was (for example) 100µs vs 130µs and such. It was not wildly fluctuating. At the very least I ran it multiple times in each configuration, in a row, one after the other, etc, etc. and the results seemed very consistent to me over many runs.

Also, I noticed the spike of downloads yesterday and found your AoC video. Thanks for trying it out! Currently in the growing pains phase. 😄

oh fun! glad I could point some people in your direction.

nvzqz commented 7 months ago

This might be a bit of work on your end, but could you please check if the same happens with Criterion?

polarathene commented 7 months ago

check if the same happens with Criterion?

I've encountered a similar experience and have provided a reproduction here with both divan and criterion consistently demonstrating a perf regression.

I've had some other users clone that and observe the same.

My reproduction is fairly simple, just copying a function that isn't used at all introduces a consistent overhead in timings measured on the actual benched function. I suspect it's due to compiler optimizations like this Criterion FAQ entry explains.

The reproduction I've linked identifies several changes that can be made to avoid the regression, so this sort of issue is unlikely anything to do with Divan (Criterion measures the same increase in timing).

ChristopherBiscardi commented 7 months ago

I've confirmed that the behavior does exist in divan and criterion. It is only for this one benchmark (part2_nom from 375 to 497). The other 3 benchmarks appear stable.

The relevant commit is: https://github.com/ChristopherBiscardi/advent-of-code/tree/26ae09c876602e405dc7943ef61764afd70d8838

Runs without part2_subject

run 1

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 39.95 µs │ 98.29 µs │ 40.33 µs │ 41.24 µs │ 100 │ 100 ├─ part2 138.4 µs │ 174.5 µs │ 143.2 µs │ 143.6 µs │ 100 │ 100 ├─ part2_aho_corasick 27.87 ms │ 30.93 ms │ 28.52 ms │ 28.49 ms │ 100 │ 100 ╰─ part2_nom 365.5 µs │ 522.1 µs │ 376.2 µs │ 380.4 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [28.212 µs 28.595 µs 28.999 µs] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe part2/part2 time: [142.92 µs 143.11 µs 143.31 µs] Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low severe 5 (5.00%) low mild 4 (4.00%) high mild part2/part2_nom time: [380.46 µs 380.83 µs 381.27 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) low mild 2 (2.00%) high mild part2/part2_aho_corasick time: [28.350 ms 28.382 ms 28.413 ms] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild ```

run 2

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 38.16 µs │ 61.91 µs │ 45.79 µs │ 44.69 µs │ 100 │ 100 ├─ part2 138.2 µs │ 178 µs │ 144.7 µs │ 144.4 µs │ 100 │ 100 ├─ part2_aho_corasick 27.86 ms │ 29.82 ms │ 28.43 ms │ 28.4 ms │ 100 │ 100 ╰─ part2_nom 359.7 µs │ 507.4 µs │ 374.2 µs │ 377.3 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [28.309 µs 28.927 µs 29.598 µs] change: [-3.7815% -1.2474% +1.3855%] (p = 0.35 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild part2/part2 time: [143.46 µs 143.75 µs 144.06 µs] change: [-0.0243% +0.2199% +0.4712%] (p = 0.11 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low mild 1 (1.00%) high mild 3 (3.00%) high severe part2/part2_nom time: [380.39 µs 383.26 µs 389.43 µs] change: [-0.3633% +1.0289% +4.5633%] (p = 0.63 > 0.05) No change in performance detected. Found 10 outliers among 100 measurements (10.00%) 9 (9.00%) low mild 1 (1.00%) high severe part2/part2_aho_corasick time: [28.401 ms 28.442 ms 28.483 ms] change: [+0.0227% +0.2104% +0.3871%] (p = 0.02 < 0.05) Change within noise threshold. ```

run 3

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 40.83 µs │ 86.7 µs │ 44.54 µs │ 44.97 µs │ 100 │ 100 ├─ part2 138.7 µs │ 173.4 µs │ 144.3 µs │ 144.2 µs │ 100 │ 100 ├─ part2_aho_corasick 27.6 ms │ 28.76 ms │ 28.43 ms │ 28.37 ms │ 100 │ 100 ╰─ part2_nom 352.9 µs │ 466.9 µs │ 375.9 µs │ 377.4 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [27.161 µs 27.608 µs 28.093 µs] change: [-3.5336% -1.2345% +1.0798%] (p = 0.32 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild part2/part2 time: [143.06 µs 143.35 µs 143.69 µs] change: [-0.3984% -0.0981% +0.2096%] (p = 0.54 > 0.05) No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe part2/part2_nom time: [380.71 µs 382.71 µs 385.95 µs] change: [-3.1649% -0.2636% +1.6587%] (p = 0.88 > 0.05) No change in performance detected. Found 16 outliers among 100 measurements (16.00%) 3 (3.00%) low severe 4 (4.00%) low mild 2 (2.00%) high mild 7 (7.00%) high severe part2/part2_aho_corasick time: [28.366 ms 28.411 ms 28.465 ms] change: [-0.3222% -0.1083% +0.1199%] (p = 0.36 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe ```

with part2_subject module

run 1 (immediately after the above runs)

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 27.79 µs │ 107.9 µs │ 30.87 µs │ 35.63 µs │ 100 │ 100 ├─ part2 144.7 µs │ 188 µs │ 151.7 µs │ 153.8 µs │ 100 │ 100 ├─ part2_aho_corasick 27.58 ms │ 35.72 ms │ 28.48 ms │ 28.71 ms │ 100 │ 100 ╰─ part2_nom 482.2 µs │ 583.9 µs │ 498.6 µs │ 500.6 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [28.547 µs 29.003 µs 29.489 µs] change: [+0.9037% +2.8745% +4.8163%] (p = 0.01 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild part2/part2 time: [142.86 µs 143.10 µs 143.35 µs] change: [-0.3534% -0.0867% +0.1954%] (p = 0.54 > 0.05) No change in performance detected. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe part2/part2_nom time: [499.69 µs 500.21 µs 500.70 µs] change: [+28.932% +30.226% +31.214%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) low severe 1 (1.00%) low mild part2/part2_aho_corasick time: [28.407 ms 28.752 ms 29.277 ms] change: [-0.0030% +1.1999% +3.2000%] (p = 0.13 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) low mild 4 (4.00%) high severe ```

run 2

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 30.16 µs │ 62.62 µs │ 35.16 µs │ 34.57 µs │ 100 │ 100 ├─ part2 136.1 µs │ 169.4 µs │ 144.7 µs │ 143 µs │ 100 │ 100 ├─ part2_aho_corasick 27.24 ms │ 28.73 ms │ 28.41 ms │ 28.3 ms │ 100 │ 100 ╰─ part2_nom 476.4 µs │ 590.2 µs │ 497.7 µs │ 497.1 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [28.270 µs 28.877 µs 29.482 µs] change: [-0.4674% +1.7851% +4.1144%] (p = 0.15 > 0.05) No change in performance detected. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild part2/part2 time: [142.71 µs 143.41 µs 144.72 µs] change: [-0.4805% +0.4224% +1.7963%] (p = 0.57 > 0.05) No change in performance detected. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) low mild 1 (1.00%) high mild 3 (3.00%) high severe part2/part2_nom time: [499.59 µs 500.15 µs 500.68 µs] change: [-0.2440% +0.0456% +0.3362%] (p = 0.76 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low severe 3 (3.00%) low mild 1 (1.00%) high mild part2/part2_aho_corasick time: [28.454 ms 28.580 ms 28.727 ms] change: [-2.4332% -0.5968% +0.7186%] (p = 0.55 > 0.05) No change in performance detected. Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low mild 3 (3.00%) high mild 7 (7.00%) high severe ```

run 3

``` Timer precision: 41 ns day_01 fastest │ slowest │ median │ mean │ samples │ iters ├─ part1 28.12 µs │ 62.58 µs │ 28.43 µs │ 33.28 µs │ 100 │ 100 ├─ part2 139.3 µs │ 184.6 µs │ 145 µs │ 145.7 µs │ 100 │ 100 ├─ part2_aho_corasick 27.9 ms │ 28.93 ms │ 28.44 ms │ 28.39 ms │ 100 │ 100 ╰─ part2_nom 477 µs │ 574.4 µs │ 497.4 µs │ 494.9 µs │ 100 │ 100 Running benches/benchmarks-criterion.rs (target/release/deps/day_01_bench_criterion-c9c1d8429cb6331d) part1/part1 time: [28.404 µs 28.897 µs 29.428 µs] change: [-3.9832% -1.6333% +0.8383%] (p = 0.20 > 0.05) No change in performance detected. Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild part2/part2 time: [143.02 µs 143.89 µs 144.99 µs] change: [-1.6549% -0.2855% +0.7242%] (p = 0.70 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) low mild 3 (3.00%) high severe part2/part2_nom time: [499.47 µs 500.23 µs 501.16 µs] change: [-0.0951% +0.2081% +0.5882%] (p = 0.22 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 1 (1.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe part2/part2_aho_corasick time: [28.380 ms 28.436 ms 28.498 ms] change: [-1.0464% -0.5065% -0.0087%] (p = 0.06 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 2 (2.00%) high mild 4 (4.00%) high severe ```

nvzqz / divan

Adding unrelated code affects benchmark #36

Runs without part2_subject

with part2_subject module