johnmyleswhite / Benchmarks.jl

A new benchmarking library for Julia
Other
45 stars 15 forks source link

Statistical error? #4

Closed sbromberger closed 9 years ago

sbromberger commented 9 years ago

I ran a (long) benchmark and the lower bound of the 95% CI is less than the minimum runtime. How should I interpret that result? (Should the CI bounds be truncated?)

julia> @benchmark out_neighbors(g,2)
================ Benchmark Results ========================
   Average elapsed time: 15.99 ns
     95% CI for average: [12.98 ns, 18.99 ns]
   Minimum elapsed time: 15.96 ns
                GC time: 0.00%
       Memory allocated: 0 bytes
  Number of allocations: 0 allocations
      Number of samples: 6201
        R² of OLS model: 0.950
Time used for benchmark: 0.04s
            Precompiled: true
       Multiple samples: true
       Search performed: true
johnmyleswhite commented 9 years ago

This is how things are supposed to work. To see why, consider the classical CLT-derived confidence interval: CI = [mean(x) - 2 * sem(x), mean(x) + 2 * sem(x)].

There is no reason to think that mean(x) - 2 * sem(x) should be above minimum(x). If you impose that constraint, your inferences would be inaccurate because the frequentist guarantees you depend upon don't refer to the minimum anywhere.

Here's a simple simulation to convince you that using the minimum would lead to wrong conclusions:

using Distributions
using StatsBase

const d = Normal(0, 1)
const n = 2

function normal_ci(x)
    (
        mean(x) - 1.96 * sem(x),
        mean(x) + 1.96 * sem(x)
    )
end

function quantile_ci(x)
    (
        max(mean(x) - 1.96 * sem(x), minimum(x)),
        min(mean(x) + 1.96 * sem(x), maximum(x)),
    )
end

function simulate(get_ci)
    success = 0
    x = Array(Float64, n)
    for sim in 1:10_000
        rand!(d, x)
        lower, upper = get_ci(x)
        if lower <= mean(d) <= upper
            success += 1
        end
    end
    success / 10_000
end

simulate(normal_ci)
simulate(quantile_ci)

In this case, I get 70% coverage for the classical CI and 50% coverage for the CI that never goes below the minimum.

If you spend some time thinking about this, I'm pretty sure you'll convince yourself that you can't use the minimum (which is itself being estimated from data) to provide a reliable lower bound on the distribution's behavior.

sbromberger commented 9 years ago

I understand how the CI works, though I'm not sure what value the CI is providing in this case, aside from some insight into the standard deviation.

Perhaps it's because I've got the real-world results for the benchmark that I don't need a (hypothetical) statistical analysis? What would I use it for? (I guess I'd rather have min/avg/max/stddev of the actual benchmark runtimes. This is probably because I'm more used to analyzing ping output than delving into advanced stats. :) )

johnmyleswhite commented 9 years ago

What are the real-world results you're referring to?

The CI's measure your uncertainty in the results of the benchmarks. This is important if you want to compare two pieces of code. Comparing the means without using CI's is very likely to lead to inaccurate conclusions.

sbromberger commented 9 years ago

For example, the following timing information provided by the benchmark output is immediately useful to me (the memory allocation info is also useful but outside the scope of this discussion):

Average elapsed time: 15.99 ns: this tells me how long, on average, this function took to run on my machine.

Minimum elapsed time: 15.96 ns: this is good information to have because it represents a "best-case" scenario, especially if I'm using random input to the function.

Number of samples: 6201: this is good info because the higher the #, the more I can "trust" the summary results. (stddev would be great here too).

I anticipate using the benchmarks to calculate the effect of changing pieces of my LightGraphs code that are "fundamental" - that is, they're core functions that are used throughout the rest of my module. So if I change the way I calculate the out_edges of a given vertex (from list comprehension to map, for example), I'd be able to see what sort of effect that has on functions that use out_edges (like Dijkstra and, therefore, betweenness centrality).

If I see that the change results in a significant increase in the average run time of the derived functions, then I know it's probably a bad change. If I see that the average is pretty much the same but the minimum elapsed time has increased, I'd want to look at whether or not I can short-circuit somewhere.

johnmyleswhite commented 9 years ago

I think you're underestimating how hard it is to draw valid inferences from benchmark data. It's worth noting several interpretative issues with your definitions of quantities that are being reported.

Average elapsed time: 15.99 ns: this tells me how long, on average, this function took to run on my machine.

Unfortunately this isn't a valid interpretation of this quantity. The sample average isn't the population average: it's our best attempt to estimate how long the function takes to run. Because it's merely a best-effort estimate, the sample average is often severely inaccurate. Having this number be incorrect by 2x or 3x is very common. If you use means without CI's for your benchmarking, you should assume that as many as 50% of your small-scale optimizations are actually regressions that you've simply measured incorrectly.

Minimum elapsed time: 15.96 ns: this is good information to have because it represents a "best-case" scenario, especially if I'm using random input to the function.

Again, the sample minimum isn't the population minimum. Imagine a world in which a background process runs for 10 seconds every 30 seconds. If you run your 10 second long benchmark while this background process is operating, your minimum will be arbitrarily far removed from the true minimum -- which only occurs when the background process has stopped. We can try to account for this by reporting CI's, but the numbers we report are always going to be overestimates -- and might be overestimates by a factor of 100x or higher. Using these kinds of numbers without acknowledging how much uncertainty we have is very risky.

Number of samples: 6201: this is good info because the higher the #, the more I can "trust" the summary results. (stddev would be great here too).

For sure, having more samples is better than fewer samples. In general, standard deviations are useful as well, although their usefulness depends upon mathematical assumptions that are generally false for benchmark data since the CLT doesn't typically apply. (In statistical terminology, the sampling distribution isn't well approximated by a normal distribution.)

Another problem with using samples directly: the actual number of samples is a substantial overestimate of how much you can trust your results, because the data points aren't independent of one another. (The example of a background process running during your benchmark shows why.) In a future pass, I'm going to start doing corrections for non-independence, which will make the CI's more credible.

If I see that the change results in a significant increase in the average run time of the derived functions, then I know it's probably a bad change. If I see that the average is pretty much the same but the minimum elapsed time has increased, I'd want to look at whether or not I can short-circuit somewhere.

This is really the crux of the matter: what do we call a significant increase? My rule of thumb is this: you should never claim that you've seen a significant increase unless the upper bound of one implementation's CI is lower than the lower bound of the other implementation's CI. Comparing means without using CI's is likely to lead to the wrong conclusion -- in the worst case (and I've seen this in practice), you're wrong as often as 50% of the time.

sbromberger commented 9 years ago

The sample average isn't the population average: it's our best attempt to estimate how long the function takes to run

Ah, I think this is the gap in my understanding: you're sampling the runtimes and averaging the subset, as opposed to averaging all of them?

The end result of this is that I really appreciate your explanations, and I'd like to suggest that you (or someone with the requisite knowledge of what the stats imply) put together a "Benchmarks.jl for Dummies" guide on how laymen should interpret the results to effect real-world change to their code. My initial (and probably current) understanding of the numbers is dangerously wrong, and I'd imagine that others who do not have a stats background (or who don't understand the approach you've taken here) might be similarly confused.

johnmyleswhite commented 9 years ago

Ah, I think this is the gap in my understanding: you're sampling the runtimes and averaging the subset, as opposed to averaging all of them?

No, the sampling is the typical philosophical conceit in data analysis, in which we pretend that our data set is like a random sample of the true population. For a benchmark, the true population is the results of our benchmark if run forever on a static system. We usually run the benchmark for 10s, which is, clearly, a lot shorter than forever. So we do have a sample, but, unlike a clean applied stats use case, our sample isn't random (and therefore doesn't generalize correctly).

The end result of this is that I really appreciate your explanations, and I'd like to suggest that you (or someone with the requisite knowledge of what the stats imply) put together a "Benchmarks.jl for Dummies" guide on how laymen should interpret the results to effect real-world change to their code. My initial (and probably current) understanding of the numbers is dangerously wrong, and I'd imagine that others who do not have a stats background (or who don't understand the approach you've taken here) might be similarly confused.

Are you going to JuliaCon? I feel like the easiest thing to do here would be to get somebody like you without a stats background up-to-speed on the measurement issues, than have them write up a doc in language that would make sense to them. I doubt I'm really able to explain things in language that doesn't try to emulate the jargon I'm used to.

Even better is choose a reporting UI that makes things clear no matter who's reading the results. Does the Criterion UI help? http://www.serpentine.com/criterion/

sbromberger commented 9 years ago

I'm out of country at the moment and on the west coast in any case, so no JuliaCon for me. I'm happy to be your guinea pig for the documentation, though, if you'd like to do a tele/videocon at some point.

As far as UI goes, perhaps that'd work, but what I really want for LightGraphs is to integrate this into CI testing so that if we see a regression, the tests fail. No UI needed in that case :)

johnmyleswhite commented 9 years ago

Let's chat after I'm back from JuliaCon then. Building automated infra to detect perf regressions is hard, but I'd like to make it easier.

sbromberger commented 9 years ago

hey @johnmyleswhite - if you'd still like to discuss this, let me know. I'm happy to help provide a layman's view if you think it'd help.

johnmyleswhite commented 9 years ago

I'm a little overwhelmed this week as I head to France on Sunday. Let's cycle back in late July.

sbromberger commented 9 years ago

No worries - safe travels!

sbromberger commented 9 years ago

Just wanted to ping now that we're headed into August - if this is something you'd still like to work on, I'm available. If you've got other priorities, I understand.

johnmyleswhite commented 9 years ago

I think I'm going to pass on doing this since I'd rather focus on finishing the code itself.

sbromberger commented 9 years ago

Yup, no worries. I'm looking forward to (ab)using this in LightGraphs when it's ready :)