Closed dinosaure closed 10 months ago
I think the PR is fine, can't spot anything wrong with it. Benchmark isolation and stability are topics to be solved outside of bechamel.
Thanks for your review, I will cut a release as soon as I can :+1:. And yes, the isolation is the main issue about bechamel
(but outside the scope of it).
Did some more tests at home, similarly I get values outside of the 95% CI when repeating the benchmark even with exact same binary and following most of the settings at https://llvm.org/docs/Benchmarking.html. This can be seen even with the built-in 'fact' benchmark, e.g. sometimes I get 'factorial functional 50' and 'factorial imperative 100' to have nearly identical values, and sometimes one surpasses the other, but not always the same one, and their KDEs look different too (i.e. the calculation appears to be correct, the values are genuinely different from run to run, clustered around the central value). Stabilizing the GC or not doesn't seem to have an effect.
I tried using 'core_bench' with similar results (using +time -ci-absolute
), so whatever the problem is it affects not only bechamel
.
The good news is that now I can easily get this data from bechamel and work on reducing noise in my measurement environment, or try to track down the source of the non-determinism! (perhaps by rerunning the benchmark several times and also merging results from that whole program rerun, otherwise even a 99% CI doesn't seem to help, there is always a small margin by which the next measurement escapes).
Tested this on https://github.com/xapi-project/stdext with the following change:
And then
opam pin add bechamel.0.3.0 https://github.com/mirage/bechamel.git\#fix-confidence-ols
anddune exec --profile=release bench/bench_encodings.exe
:However running it again produces values that are outside the 95% CI of the previous run:
Now that may not be the fault of bechamel, my system may genuinely introduce variance in a run that doesn't exist in the other, I'll take a closer look at that (well at least the numbers are plausibly close, and some do even overlap). I'll probably extract the raw numbers and do some side-by-side violin plots to figure out what is going on.