Use a better measures than average and standard deviation

mahmoud commented 8 years ago

Hey Victor! Great library you've got going here (not a surprise; I've been a fan since long before faulthandler was in the stdlib).

As I mentioned on Twitter, I think perf would benefit from using a more robust set of measures than average + standard deviation. I recommended median and median absolute deviation. I have a well-rounded Python2/3 compatible implementation in boltons.statsutils. You can lift the whole module, if you'd like, or just reimplement those functions.

In addition to making the statistics more robust, this also gets rid of the pesky problem of whether to use Bessel's correction. The Python 3 statistics module is rife with it, and in this domain it raises more questions than it answers.

I've written a bit about statistics for exactly cases like these, so I'm obligated to link: Statistics for Software. Let me know if that helps and/or if I can help clarify further.

superbobry commented 8 years ago

Just for the record, criterion, a Haskell benchmarking library, also does mean and SD, but complements them with a bootstrap confidence intervals. This approach is more computationally intensive than median and MAD, but I think it could be more useful in practice, because ranges carry more information than point estimates.

mahmoud commented 8 years ago

That's interesting. We looked at bootstrap methods for our performance analysis at work (and my benchmarks in general), but found them to be heavyhanded, not to mention misleading with empirical performance measurements. For instance, criterion performs linear regression analysis, but what does that mean for nonlinear cases?

With 99% of performance work, especially microbenchmarks, I'm interested in the exact measurements, and will handle the predictions myself, thank you.

Love criterion's KDEs though. Not sure if that could ever work in the console, but one can dream. :)

superbobry commented 8 years ago

The assumption criterion uses is that the time can be explained as a weighted combination of the # of benchmark runs plus other variables, e.g. GC stats. This is a simple model and, as with all models, it isn't perfect, but I think it is still pretty good (see also core_bench). We did some synthetic benchmarks on it a while ago in the context of CPython and haven't seen any misleading measurements.

Also, the regression part of creterion is unrelated to bootstrap. The latter simply allows you to construct estimates without assuming any specific model for the data.

Can you elaborate on your experience with boostrap/regression methods for benchmarking? Specific examples leading to inaccurate estimates would be very helpful.

mahmoud commented 8 years ago

The primary question is one of audience. Bootstrapping to get confidence intervals results is often misleading because the process and results are nontrivial to interpret. Criterion tries hard around this, but I still don't think it closes the gap nearly enough to be useful as a timeit standin.

Adding a second dimension, I have a philosophical question mark with regard to whether bootstrapping applies. It can be applied, the numbers go into the equations, but is it valuable to apply it? Looking at the origins of bootstrapping, it's a frequentist formulation designed to make threadbare samples more usable with other frequentist techniques, as the name suggests. I think there is an argument to be made that in many performance cases we don't have an imperfect sample. We often have the ability to perfectly sample or even compute statistics on the entire population.

Bootstrapping or jack knifing are certainly interesting. I initially had a section for them in Statistics for Software. But the more I ran it past other developers, I concluded that they were not practical performance tools, and arguably added to the confusion around an already confusing topic. Better to stick to the nonparametric fundamentals.

vstinner commented 8 years ago

@mahmoud: "I recommended median and median absolute deviation."

I don't understand well the difference between median and mean. I don't understand which one is the "good one": more representative of the performance the users will see on their computer.

I plotted the distribution of one benchmark, "bench_dict.py if -p 250 -n 10" (2500 samples) on a very busy system without CPU isolation:

if_busy

I plotted a lot of files with various configuration: CPU isolation on/off, system idle/busy, different programs, etc. This shape is quite common on my tests: the curve is "cutted" at the left and has a more or less long tail at the right.

The red line is the median, the blue line is the mean. The mean line looks like to be the mode, whereas the median is at the left. I counted the number of samples:

median: left=50%, right=50%, as expected (it's the definition of the median, no?)
mean: left=60%, right=40%

mahmoud commented 8 years ago

Re: mean and median, the short answer is that the median is the better one. If there were an anomaly in testing and one test ran very fast or very slow, the mean would be the one that is skewed. A similar relationship exists between standard deviation and median absolute deviation (MAD).

You can see this in your chart. The blue mean line is overrepresenting the slower runs, even though they are in the minority. The median gives equal representation, and is the more appropriate indicator of central value.

Re: curve shape, yes, that looks about right for a lot of performance measurements. As I mention (offhandedly) in the statistics post, the two most helpful distributions for modelling performance are the Poisson and Erlang. That said, nonparametric statistics like the minimum and median are the easiest to interpret and build on.

vstinner commented 8 years ago

myself:

The mean line looks like to be the mode, whereas the median is at the left.

Oops, it's just that the curve was explicitly centered on the mean in my code, that's why the mean fits well on this curve :-D So ignore this sentence of my previous message... I never really used numpy+scipy before, I'm still learning how to plot things :-)

vstinner commented 8 years ago

@superbobry: "Just for the record, criterion, a Haskell benchmarking library, also does mean and SD, but complements them with a bootstrap confidence intervals. This approach is more computationally intensive than median and MAD, but I think it could be more useful in practice, because ranges carry more information than point estimates."

They are many ways to display data: draw a histogram, display the mean, etc. But let me explain the basic use case: the perf module is written for developers who want to check if an optimization makes the code faster or slower, and so compute two benchmark results. Currently, most Python developers use the minimum since it's the defacto standard from the timeit module. Using the minimum is simple to understand and with enough samples, it's a little bit reliable.

Multiple ranges seem overcomplicated for developers. Most of them don't know anything about statistics and only want a single number. I'm trying to do better by providing two numbers: mean + standard deviation. The perf module is too young to know if users and developers are able to understand that.

vstinner commented 8 years ago

I discussed on the #pypy IRC channel to try to understand the difference between mean and median:

haypo: better for what purpose? median is robuster against outliers, is that a feature or a drawback?

I understood that there are two different and incompatible point of views:

(1) applications always run on busy system so hiding the "system noise" is wrong: benchmarks should "show" the noise
(2) I want to be able to compare two benchmarks results and so no being affected by system

If I should make a choice between the two use cases, I would pick (2) because it's more important to compare results and get reliable and reproductible benchmarks.

If I understood correctly, the mean should be used for (1) and the median for (2).

vstinner commented 8 years ago

Re: mean and median, the short answer is that the median is the better one. If there were an anomaly in testing and one test ran very fast or very slow, the mean would be the one that is skewed. A similar relationship exists between standard deviation and median absolute deviation (MAD).

About the median absolute deviation (MAD): it's still a little bit hard for me to understand why MAD would be better than standard deviation.

If I used mean+std dev or median+std dev, the range contains usually between 75% and 90% samples.

If I use median+MAD, the range contains 50% of samples. I guess that it's again the definition of MAD.

So MAD range looks smaller than std dev. Why do you consider that it's better to have a smaller range?

Would you be ok to use median to ignore outliers caused by the "system noise", but with standard deviation to "show" the system noise?

In the development version, I'm now using the standard deviation to warn users when the benchmarks "looks" unstable.

Example of unstable benchmark because the number of loops is too low:

$ python3 -m perf.timeit --loops=10 pass
.........................
WARNING: the benchmark seems unstable, the standard deviation is high (11%)
Try to rerun the benchmark with more runs, samples and/or loops

ERROR: the benchmark may be very unstable, the shortest sample only took 310 ns
Try to rerun the benchmark with more loops or increase --min-time

Average: 36.9 ns +- 4.2 ns

vstinner commented 8 years ago

WARNING: the benchmark seems unstable, the standard deviation is high (11%)

Note: 11% comes from stdev(samples) / mean(samples). I'm not sure if this check is reliable to detect "unstable" benchmarks.

vstinner commented 8 years ago

It looks like mean and median are almost equal when the benchmark is run with isolated CPUs: https://haypo.github.io/journey-to-stable-benchmark-system.html

Example:

Mean + std dev: 46.0 ms +- 0.7 ms
Median +- MAD: 45.9 ms +- 0.6 ms
Median +- std dev: 45.9 ms +- 0.7 ms

The difference is more important without isolated CPUs so with the "system noise". Example with on a very busy system:

Mean + std dev: 1.17 us +- 0.35 us
Median +- MAD: 1.05 us +- 0.20 us
Median +- std dev: 1.05 us +- 0.35 us

mahmoud commented 8 years ago

Re: the two points of view: I agree that you want the second option, minimizing system noise for more comparability. Not all noise is made equal, so the median is the way to go.

Re: MAD, if you're trying to detect overall spread as an indicator of system noise, then I agree that standard deviation might do a better job there. Really though, some benchmarks may naturally have a 10%+ normalized standard deviation, and what you're trying to do is detect the outliers. Consider printing the min-max range with the proportional deviation.

Median: 100.0us
Min: 50us (-50.0%)
Max: 200.0us (+100.0%)

For boltons, I have a describe method inspired by Pandas, with the addition of MAD. It provides all the greatest hits:

>>> stats = boltons.statsutils.Stats(range(1, 8))
>>> print(stats.describe(format='text'))
count:    7
mean:     4.0
std_dev:  2.0
mad:      2.0
min:      1
0.25:     2.5
0.5:      4
0.75:     5.5
max:      7

But I understand if that's too much output. With decent histogram binning, the min comes right out and the max is usually not hard to infer.

All said and done, I see the appeal of standard deviation for this case. It provides somewhat dampened noise detection.

mahmoud commented 8 years ago

Also it bears repeating for the record that the minimum was not simply the default choice that timeit arbitrarily made.

Most practical performance scenarios that are worthy of microbenchmarking are positively skewed (tail is on the right)
Up until PyPy and Python 3 started getting viable, CPython 2 was the Python franca. CPython 2 is remarkably consistent. It needs no warmups and needed no hash randomization.

I'm not saying that we can't do better, but for a single number, I just wanted to recognize that the minimum has some strong points and a strong history. (Even Kevin concludes that he still uses the minimum.) Still, maybe the time for that has passed.

Personally, I can't think of a more pragmatic advancement than more runs in separate processes with a histogram. That's what makes perf great!

vstinner commented 8 years ago

CPython 2 is remarkably consistent. It needs no warmups and needed no hash randomization.

There are other sources of randomness than hash randomization, like address space randomization, command line arguments, environment variables, etc. https://haypo.github.io/journey-to-stable-benchmark-average.html

mahmoud commented 8 years ago

Oh I've read the post! ASLR is real. But its performance impact is not made clear in that post. And regardless, compared to most big name runtimes, CPython is still remarkably consistent. Not perfectly consistent, but remarkably consistent nonetheless. But the world is bigger than CPython 2 now.

vstinner commented 8 years ago

(Oh, it looks like I lost my long comment. I had to rewrite it from scratch :-( )

I ran bm_regex_v8.py benchmark on a very busy system. I ran random scripts to make my computer much slower, but not constantly. I stopped scripts, then restarted them, etc. Result:

regex_v8_busy

I see two gaussian curves:

left: mode near 47.2, truncated at the left; probably performances when the system is idle or at least less busy
middle: mode near 73.5, probably performances when the system is very busy

In short, the benchmark is 1.6x slower when the system is busy.

Statistics:

Number of samples: 2500
Minimum 46.0 ms
Maximum 121 ms

Mean + std dev: 60.2 ms +- 13.5 ms
Median +- std dev: 51.4 ms +- 13.5 ms
Median +- MAD: 51.4 ms +- 5.0 ms

Skewness: 0.47

Mean+stdev range buckets: 71.8% (1794/2500)
Median+mad range buckets: 50.0% (1250/2500)
Median+stdev range buckets: 56.7% (1417/2500)

The blue median+mad curve is the closest to the left gaussian curve (performance when system is idle). The red mean+std dev contains more samples than the two median curves. The green median + std dev curse is somehow a compromise between the two curses.

Now the question at one million dollar is: what is the "correct" curve?

This example was created to exaggerate the effect of the system noise and understand the purpose of mean/median and std dev/MAD.

State of my mind:

Median+MAD seems better to isolate the performance when the system is idle
Std dev seems to help to identify when the system is busy and so the benchmark may be unstable
I'm unable to choose between MAD or std dev for the default (default quiet output of "perf.timeit" for example)

mahmoud commented 8 years ago

Once you get into bimodal/multimodal data, such as that which inconsistent load creates, the game gets massively complex. The "center" loses meaning, because you really have two centers. This is where the histogram is absolutely necessary and maybe not sufficient.

I still say median + standard deviation for providing a general two-number figure that can help with noise detection. The min-max range, interquartile range, and histogram would be my next steps.

vstinner commented 8 years ago

Some other examples.

call_simple.

call_simple

On this one, std dev seems inaccurate because it incluses the "noise".

telco, nice full gaussian curve :-)

telco

timeit: "pass" instruction on Python 2.7, min=10 ns, max=11 ns. 1000 samples.

timeit

timeit 2: "pass" on Python 3 with CPU isolation. 98% of 2500 samples have the value 14.4 ns!

timeit2

Raw data:

14.4 ns: 2452 #############################################################################################
14.5 ns:    1 |
14.6 ns:    9 |
14.7 ns:    0 |
14.8 ns:    0 |
14.9 ns:    2 |
15.0 ns:    1 |
15.1 ns:    5 |
15.2 ns:   14 #
15.3 ns:    6 |
15.4 ns:    0 |
15.5 ns:    0 |
15.6 ns:    0 |
15.7 ns:    0 |
15.8 ns:    0 |
15.9 ns:   10 |

Total number of samples: 2500

Mean + std dev: 14.4 ns +- 0.1 ns
Median +- std dev: 14.4 ns +- 0.1 ns
Median +- MAD: 14.4 ns +- 0.0 ns

timeit 2, zoomed:

timeit2_zoom

Again, here the std dev doesn't seem accurate since it includes the "noise", the right long tail.

vstinner commented 8 years ago

FYI I started to work on a "stats" command for perf:

haypo@selma$ python3 -m perf stats timeit_pass_py2_isolcpus.json 
Number of samples: 2500
Minimum 9.87 ns
Maximum 12.5 ns

Mean + std dev: 10.3 ns +- 0.3 ns
Median +- std dev: 10.2 ns +- 0.3 ns
Median +- MAD: 10.2 ns +- 0.1 ns

Skewness: 4.78

Mean+stdev range buckets: 94.4% (2360/2500)
Median+mad range buckets: 50.0% (1250/2500)
Median+stdev range buckets: 94.4% (2360/2500)

The "range buckets" part should go away as soon as this issue is fixed.

We can extend easily this command since it will not be the default.

mahmoud commented 8 years ago

That's great! I may have a pull request for you in the near future in that case.

vstinner commented 8 years ago

I released perf 0.4 which uses boltons.statsutils if available in stats and hist_scipy commands. The stats commands computes mean, median, standard deviation, median absolute deviation and skewness.

Just after the perf 0.4 release, I replaced mean with median in the development branch.

vstinner commented 8 years ago

I removed mean from perf and finished to replace mean with median everywhere.

I plotted a lot of benchmarks, and I'm not fully convinced yet that the median absolute deviation (MAD) is "better" than the standard deviation (stdev). I kept the standard deviation and removed MAD from perf.

In my experience, stdev helps to identify more easily when the benchmark is unstable and represents better the actual benchmark results. IMO MAD hides too much information. The temptation to ignore/remove/hide outliers is high, but I'm not convinced yet that it is correct to do that.

Anyway, this discussion was helpful. At least, I replaced mean with median :-)

I also replaced "Average: ... +- ..." with "Median +- std dev: ... +- ..." to be very explicit on what is displayed. So it becomes possible to change that later.

mahmoud commented 8 years ago

Yep! Couldn't have said it better myself. Glad this helped!

vstinner commented 8 years ago

Example where MAD seems overoptimistic. The benchmark was run on a very busy system, it was a deliberate test to check how perf describes such "random" samples with a lot of variation.

MAD is "+- 1.2 ms", wheras is std dev is "+- 112.8 ms".

$ python3 -m perf show call_simple_busy.json 
ERROR: the benchmark is very unstable, the standard deviation is very high (stdev/median: 239%)!
Try to rerun the benchmark with more runs, samples and/or loops

Median +- std dev: 47.2 ms +- 112.8 ms

$ python3 -m perf stats call_simple_busy.json 
Number of samples: 1000

Minimum: 45.6 ms (-3.4%)
Mean + std dev: 105 ms +- 113 ms
Median +- std dev: 47.2 ms +- 112.8 ms
Median +- MAD: 47.2 ms +- 1.2 ms
Maximum: 593 ms (+1155.8%)

Skewness: 1.97

$ python3 -m perf hist call_simple_busy.json 
43.8 ms: 716 ###############################################################################
65.7 ms:  14 ##
87.6 ms:  39 ####
110 ms:   9 #
131 ms:  29 ###
153 ms:   9 #
175 ms:  19 ##
197 ms:   5 #
219 ms:  19 ##
241 ms:  15 ##
263 ms:   7 #
285 ms:  10 #
307 ms:  18 ##
329 ms:  26 ###
351 ms:  12 #
372 ms:  12 #
394 ms:  12 #
416 ms:  10 #
438 ms:   8 #
460 ms:   2 |
482 ms:   4 |
504 ms:   1 |
526 ms:   1 |
548 ms:   0 |
570 ms:   2 |
591 ms:   1 |

mahmoud commented 8 years ago

I mean, that's fine. You're designing the stats to match a certain intuition, which is probably wise, given how people will use them. You could get a similar, if not better, nonparametric result with ranges or trimmed ranges.

Standard deviation isn't the worst for this case. And standard deviation feels familiar. But how intuitive is it really?

standard_deviation_diagram svg

In a normal distribution, the standard deviation is ~33% of the range. That's not really an intuitive calculation, and that's the one that's most commonly taught. It gets harder for other distributions.

So really the standard deviation here is being used as a semi-arbitrary indicator of variation that enables some comparison between runs. It's not the worst thing. Still, I'm going to look to the median and histogram the most. :)

Keep up the good work!

psf / pyperf

Use a better measures than average and standard deviation #1