Closed sbromberger closed 9 years ago
This is how things are supposed to work. To see why, consider the classical CLT-derived confidence interval: CI = [mean(x) - 2 * sem(x), mean(x) + 2 * sem(x)]
.
There is no reason to think that mean(x) - 2 * sem(x)
should be above minimum(x)
. If you impose that constraint, your inferences would be inaccurate because the frequentist guarantees you depend upon don't refer to the minimum anywhere.
Here's a simple simulation to convince you that using the minimum would lead to wrong conclusions:
using Distributions
using StatsBase
const d = Normal(0, 1)
const n = 2
function normal_ci(x)
(
mean(x) - 1.96 * sem(x),
mean(x) + 1.96 * sem(x)
)
end
function quantile_ci(x)
(
max(mean(x) - 1.96 * sem(x), minimum(x)),
min(mean(x) + 1.96 * sem(x), maximum(x)),
)
end
function simulate(get_ci)
success = 0
x = Array(Float64, n)
for sim in 1:10_000
rand!(d, x)
lower, upper = get_ci(x)
if lower <= mean(d) <= upper
success += 1
end
end
success / 10_000
end
simulate(normal_ci)
simulate(quantile_ci)
In this case, I get 70% coverage for the classical CI and 50% coverage for the CI that never goes below the minimum.
If you spend some time thinking about this, I'm pretty sure you'll convince yourself that you can't use the minimum (which is itself being estimated from data) to provide a reliable lower bound on the distribution's behavior.
I understand how the CI works, though I'm not sure what value the CI is providing in this case, aside from some insight into the standard deviation.
Perhaps it's because I've got the real-world results for the benchmark that I don't need a (hypothetical) statistical analysis? What would I use it for? (I guess I'd rather have min/avg/max/stddev of the actual benchmark runtimes. This is probably because I'm more used to analyzing ping
output than delving into advanced stats. :) )
What are the real-world results you're referring to?
The CI's measure your uncertainty in the results of the benchmarks. This is important if you want to compare two pieces of code. Comparing the means without using CI's is very likely to lead to inaccurate conclusions.
For example, the following timing information provided by the benchmark output is immediately useful to me (the memory allocation info is also useful but outside the scope of this discussion):
Average elapsed time: 15.99 ns
: this tells me how long, on average, this function took to run on my machine.
Minimum elapsed time: 15.96 ns
: this is good information to have because it represents a "best-case" scenario, especially if I'm using random input to the function.
Number of samples: 6201
: this is good info because the higher the #, the more I can "trust" the summary results. (stddev would be great here too).
I anticipate using the benchmarks to calculate the effect of changing pieces of my LightGraphs code that are "fundamental" - that is, they're core functions that are used throughout the rest of my module. So if I change the way I calculate the out_edges of a given vertex (from list comprehension to map, for example), I'd be able to see what sort of effect that has on functions that use out_edges (like Dijkstra and, therefore, betweenness centrality).
If I see that the change results in a significant increase in the average run time of the derived functions, then I know it's probably a bad change. If I see that the average is pretty much the same but the minimum elapsed time has increased, I'd want to look at whether or not I can short-circuit somewhere.
I think you're underestimating how hard it is to draw valid inferences from benchmark data. It's worth noting several interpretative issues with your definitions of quantities that are being reported.
Average elapsed time: 15.99 ns: this tells me how long, on average, this function took to run on my machine.
Unfortunately this isn't a valid interpretation of this quantity. The sample average isn't the population average: it's our best attempt to estimate how long the function takes to run. Because it's merely a best-effort estimate, the sample average is often severely inaccurate. Having this number be incorrect by 2x or 3x is very common. If you use means without CI's for your benchmarking, you should assume that as many as 50% of your small-scale optimizations are actually regressions that you've simply measured incorrectly.
Minimum elapsed time: 15.96 ns: this is good information to have because it represents a "best-case" scenario, especially if I'm using random input to the function.
Again, the sample minimum isn't the population minimum. Imagine a world in which a background process runs for 10 seconds every 30 seconds. If you run your 10 second long benchmark while this background process is operating, your minimum will be arbitrarily far removed from the true minimum -- which only occurs when the background process has stopped. We can try to account for this by reporting CI's, but the numbers we report are always going to be overestimates -- and might be overestimates by a factor of 100x or higher. Using these kinds of numbers without acknowledging how much uncertainty we have is very risky.
Number of samples: 6201: this is good info because the higher the #, the more I can "trust" the summary results. (stddev would be great here too).
For sure, having more samples is better than fewer samples. In general, standard deviations are useful as well, although their usefulness depends upon mathematical assumptions that are generally false for benchmark data since the CLT doesn't typically apply. (In statistical terminology, the sampling distribution isn't well approximated by a normal distribution.)
Another problem with using samples directly: the actual number of samples is a substantial overestimate of how much you can trust your results, because the data points aren't independent of one another. (The example of a background process running during your benchmark shows why.) In a future pass, I'm going to start doing corrections for non-independence, which will make the CI's more credible.
If I see that the change results in a significant increase in the average run time of the derived functions, then I know it's probably a bad change. If I see that the average is pretty much the same but the minimum elapsed time has increased, I'd want to look at whether or not I can short-circuit somewhere.
This is really the crux of the matter: what do we call a significant increase? My rule of thumb is this: you should never claim that you've seen a significant increase unless the upper bound of one implementation's CI is lower than the lower bound of the other implementation's CI. Comparing means without using CI's is likely to lead to the wrong conclusion -- in the worst case (and I've seen this in practice), you're wrong as often as 50% of the time.
The sample average isn't the population average: it's our best attempt to estimate how long the function takes to run
Ah, I think this is the gap in my understanding: you're sampling the runtimes and averaging the subset, as opposed to averaging all of them?
The end result of this is that I really appreciate your explanations, and I'd like to suggest that you (or someone with the requisite knowledge of what the stats imply) put together a "Benchmarks.jl for Dummies" guide on how laymen should interpret the results to effect real-world change to their code. My initial (and probably current) understanding of the numbers is dangerously wrong, and I'd imagine that others who do not have a stats background (or who don't understand the approach you've taken here) might be similarly confused.
Ah, I think this is the gap in my understanding: you're sampling the runtimes and averaging the subset, as opposed to averaging all of them?
No, the sampling is the typical philosophical conceit in data analysis, in which we pretend that our data set is like a random sample of the true population. For a benchmark, the true population is the results of our benchmark if run forever on a static system. We usually run the benchmark for 10s, which is, clearly, a lot shorter than forever. So we do have a sample, but, unlike a clean applied stats use case, our sample isn't random (and therefore doesn't generalize correctly).
The end result of this is that I really appreciate your explanations, and I'd like to suggest that you (or someone with the requisite knowledge of what the stats imply) put together a "Benchmarks.jl for Dummies" guide on how laymen should interpret the results to effect real-world change to their code. My initial (and probably current) understanding of the numbers is dangerously wrong, and I'd imagine that others who do not have a stats background (or who don't understand the approach you've taken here) might be similarly confused.
Are you going to JuliaCon? I feel like the easiest thing to do here would be to get somebody like you without a stats background up-to-speed on the measurement issues, than have them write up a doc in language that would make sense to them. I doubt I'm really able to explain things in language that doesn't try to emulate the jargon I'm used to.
Even better is choose a reporting UI that makes things clear no matter who's reading the results. Does the Criterion UI help? http://www.serpentine.com/criterion/
I'm out of country at the moment and on the west coast in any case, so no JuliaCon for me. I'm happy to be your guinea pig for the documentation, though, if you'd like to do a tele/videocon at some point.
As far as UI goes, perhaps that'd work, but what I really want for LightGraphs is to integrate this into CI testing so that if we see a regression, the tests fail. No UI needed in that case :)
Let's chat after I'm back from JuliaCon then. Building automated infra to detect perf regressions is hard, but I'd like to make it easier.
hey @johnmyleswhite - if you'd still like to discuss this, let me know. I'm happy to help provide a layman's view if you think it'd help.
I'm a little overwhelmed this week as I head to France on Sunday. Let's cycle back in late July.
No worries - safe travels!
Just wanted to ping now that we're headed into August - if this is something you'd still like to work on, I'm available. If you've got other priorities, I understand.
I think I'm going to pass on doing this since I'd rather focus on finishing the code itself.
Yup, no worries. I'm looking forward to (ab)using this in LightGraphs when it's ready :)
I ran a (long) benchmark and the lower bound of the 95% CI is less than the minimum runtime. How should I interpret that result? (Should the CI bounds be truncated?)