Open str4d opened 7 years ago
I just ran into this and was also a bit unenthused with the +/-
calculation.
The standard deviation would be preferred.
Would changing the semantics of this output be a breaking change?
I've started doing a lot more benching and this is rather getting on my nerves. This is my suggestion: https://github.com/saethlin/rust/commit/20e3955261df504e4e7a626c0cb47bbab4bde708
Commit message, with my thoughts on what stats should be the default: Median is pretty good, we can keep that. If we want to describe a measure of spread, I think median absolute deviation is the easy choice here; running a microbenchmark on a normal desktop I expect to have considerable outliers. This is resistant to them, as opposed to standard deviation or range. The min is probably the most valuable stat here; if we believe that all other variations in the benchmark are due to CPU jitter or warmup, the fastest run is the more reproducible measurement. All other measures are sensitive to current CPU load. (If anyone here has exercised the current version a lot, you may have noticed that your median times and especially the +/- number go way down when you close your web browser, music player, chat clients, etc.)
I'll also add that if we want a measure of error, we should probably report the median absolute deviation divided by the square root of the number of trials (like standard deviation of the mean). I'm unclear on if reporting spread or error is more valuable, especially if we're going to lean on the minimum as God's Honest Truth for how fast a piece of code is the spread is one-sided. However, if people are logging this output for performance tracking it would be quite valuable to quantify the interval over which one can expect subsequent trials to vary. If we want to communicate that information, should we report 2-sigma errors?
Triage: no change that I'm aware of.
Currently,
cargo bench
prints output along these lines:The values are derived here, from which we can determine that the key value is a median, which is good! Unfortunately, the other value given is a range, which is not particularly useful:
+/-
, only makes sense if the underlying benchmark has a normal (Gaussian) distribution. But depending on the benchmark, that may not be the case. Say the output is100 +/- 20
: there is no way to distinguish between a(min, max)
of(81, 101)
versus(99, 119)
, which should be interpreted very differently by the programmer.I would personally like to gain access to the min and max values for the purpose of CI benchmarks (e.g. here), so I'd like to see those values exposed either by default, or accessible via a flag. Alternatively (or in addition), the range should be replaced with a standard deviation or standard error.