sharkdp / hyperfine

A command-line benchmarking tool
Apache License 2.0
21.42k stars 343 forks source link

Confidence intervals, p-values #523

Open jamiejennings opened 2 years ago

jamiejennings commented 2 years ago

First, nice job on Hyperfine!

I use it in academic research, where I'll be computing p-values to describe the confidence with which the sample data suggests that one program is faster than another. (I.e. how confident are we that the mean speed of the fastest program is actually different from the mean speed of the other programs.)

Would you be open to a pull request that adds such a calculation to Hyperfine?

On a related note, I am not sure how to interpret 1.40 ± 0.03 times faster in Hyperfine output. From the source code, it looks like a calculation based on one standard deviation. Is this meant to say, e.g., that if the same invocation of Hyperfine were done 100 times, that we expect 68% of those outcomes to say that program A is between 1.37 and 1.43 times as fast as program B (where 68% corresponds to one sigma)? Is it a confidence interval?

Thanks in advance for any clarification you can provide!

sharkdp commented 2 years ago

I use it in academic research, where I'll be computing p-values to describe the confidence with which the sample data suggests that one program is faster than another. (I.e. how confident are we that the mean speed of the fastest program is actually different from the mean speed of the other programs.)

Please note that we have some functionality in that direction in the scripts folder (https://github.com/sharkdp/hyperfine/tree/master/scripts): both the advanced_statistics and the welch_ttest scripts could be interesting for you.

Would you be open to a pull request that adds such a calculation to Hyperfine?

Maybe we could work on those Python scripts first and then think about integrating it as a core-functionality to hyperfine? What do you think?

On a related note, I am not sure how to interpret 1.40 ± 0.03 times faster in Hyperfine output.

Please see #443

From the source code, it looks like a calculation based on one standard deviation. Is this meant to say, e.g., that if the same invocation of Hyperfine were done 100 times, that we expect 68% of those outcomes to say that program A is between 1.37 and 1.43 times as fast as program B (where 68% corresponds to one sigma)?

Not quite. It means that if you execute your program a large numer of times, that 68% of the program runtimes would be within this interval (assuming a normal distribution of the runtimes, which is tyipcally not the case). It is not the standard error.. that's lower by a factor of sqrt(N_benchmarks).

jberdine commented 2 years ago

A motivation to include intervals and p-values in the core would be to enable performing runs until a desired p-value is reached, as an alternate to the current calculation of the number of runs.

sharkdp commented 2 years ago

@jamiejennings any update on this?