Expose min and max values in benchmarks

str4d commented 7 years ago

Currently, cargo bench prints output along these lines:

test bls12_381::bench_pairing_final_exponentiation ... bench:   1,753,823 ns/iter (+/- 30,711)
test bls12_381::bench_pairing_full                 ... bench:   2,641,309 ns/iter (+/- 35,774)
test bls12_381::bench_pairing_g1_preparation       ... bench:      13,801 ns/iter (+/- 293)
test bls12_381::bench_pairing_g2_preparation       ... bench:     233,722 ns/iter (+/- 11,857)
test bls12_381::bench_pairing_miller_loop          ... bench:     618,548 ns/iter (+/- 26,680)
test bls12_381::ec::g1::bench_g1_add_assign        ... bench:       1,189 ns/iter (+/- 53)

The values are derived here, from which we can determine that the key value is a median, which is good! Unfortunately, the other value given is a range, which is not particularly useful:

As-printed, it looks like an error or standard deviation; either mis-interpretation would mean that the uncertainty is overstated by at least a factor of two!
The range itself, in the context of a +/-, only makes sense if the underlying benchmark has a normal (Gaussian) distribution. But depending on the benchmark, that may not be the case. Say the output is 100 +/- 20: there is no way to distinguish between a (min, max) of (81, 101) versus (99, 119), which should be interpreted very differently by the programmer.

I would personally like to gain access to the min and max values for the purpose of CI benchmarks (e.g. here), so I'd like to see those values exposed either by default, or accessible via a flag. Alternatively (or in addition), the range should be replaced with a standard deviation or standard error.

JustAPerson commented 6 years ago

I just ran into this and was also a bit unenthused with the +/- calculation. The standard deviation would be preferred. Would changing the semantics of this output be a breaking change?

saethlin commented 6 years ago

I've started doing a lot more benching and this is rather getting on my nerves. This is my suggestion: https://github.com/saethlin/rust/commit/20e3955261df504e4e7a626c0cb47bbab4bde708

Commit message, with my thoughts on what stats should be the default: Median is pretty good, we can keep that. If we want to describe a measure of spread, I think median absolute deviation is the easy choice here; running a microbenchmark on a normal desktop I expect to have considerable outliers. This is resistant to them, as opposed to standard deviation or range. The min is probably the most valuable stat here; if we believe that all other variations in the benchmark are due to CPU jitter or warmup, the fastest run is the more reproducible measurement. All other measures are sensitive to current CPU load. (If anyone here has exercised the current version a lot, you may have noticed that your median times and especially the +/- number go way down when you close your web browser, music player, chat clients, etc.)

I'll also add that if we want a measure of error, we should probably report the median absolute deviation divided by the square root of the number of trials (like standard deviation of the mean). I'm unclear on if reporting spread or error is more valuable, especially if we're going to lean on the minimum as God's Honest Truth for how fast a piece of code is the spread is one-sided. However, if people are logging this output for performance tracking it would be quite valuable to quantify the interval over which one can expect subsequent trials to vary. If we want to communicate that information, should we report 2-sigma errors?

steveklabnik commented 4 years ago

Triage: no change that I'm aware of.

rust-lang / rust

Expose min and max values in benchmarks #44358