Open jfbastien opened 7 years ago
The value shown is the geometric mean of the average runs/sec of the individual tests. An earlier version was computing the geometric mean of the times of the individual tests, which was obviously off, probably you had seen that.
We'll be looking into adding confidence interval for the geometric mean.
The confidence intervals we compute for the individual tests is actually the relative standard error of the mean of the individual runtimes. This would make perfect sense if the individual samples were statistically independent. However, since the samples are iterations running in the same isolate, they share feedback, optimized code and GC. So they are not independent, especially the first iterations are always slower than the later iterations due to missing feedback and optimized code. So the reported error is systematically too big.
One way to fix this is to separate warm-up and only measure the steady-state of the system. However, this assumes that the steady-state of the VM is always the same, which is probably not the case.
Another way is to re-run the complete benchmark as independently as possible (in a new iframe/realm, or even better: a new process) and only compute confidence intervals based on results of complete runs. This is what the ARES-6 runner does.
My team is obviously biased towards liking the ARES-6 approach, in part for the reasons that you state.
I guess we could remove the first N samples and do the average/stddev solely on the remaining samples.
iframe/realm is not really an option here, since Node doesn't support realms and this is primarily interesting for Node. There's only the vm
module, but that does a lot of magic with the global object, which makes it unsuitable for representative performance measurement.
Couple general comments, but no strong opinions:
If anything, I'd think this benchmark would only really care about the first run because that's what the tools hit, no? We do multi-runs to get statistical significance, so maybe in a node context you actually want to run the tests from the command-line and do aggregation there, whereas in a web or browser shell context the ARES-6 approach of creating a new iframe/realm is appropriate?
The first run is almost completely irrelevant in practice. The common case is that you have webpack driving the build via loaders (i.e. Babel, TypeScript, PostCSS, Prepack, etc.) and those are run on hundreds to thousands of files. Oftentimes the build takes tens of seconds up to several minutes for a realistic web application. Same for tools like prettier that usually run on several files at once. All of this usually runs in a single Node process, in the same isolate.
@bmeurer ah that's interesting! Shouldn't the benchmark try to capture this type of call graph? I'm guessing you think what you've come up with is highly correlated with that call graph? If so then your setup makes sense.
I've experimented with different setups, including just benchmarking an actual webpack build with all the loaders you needed for that project. But this usually didn't work out well, partially because of I/O and the dependency on Node, which made it impossible to run in the browser/shell. This current setup seems to be matching reality quite well, while mocking out any kind of I/O and creating representative versions for browser/shell.
I'm not saying this is the only possible solution, and I'm certainly open for potential changes here in the future.
It would be great for geomean to have a confidence interval.
Also, the calculation of geomean seems off?