Fix the boostrap calculation for the OLS analyze (see #42)

edwintorok commented 10 months ago

Tested this on https://github.com/xapi-project/stdext with the following change:

diff --git a/lib/xapi-stdext-encodings/bench/bechamel_simple_cli.ml b/lib/xapi-stdext-encodings/bench/bechamel_simple_cli.ml
index d4e58e5..424f2f6 100644
--- a/lib/xapi-stdext-encodings/bench/bechamel_simple_cli.ml
+++ b/lib/xapi-stdext-encodings/bench/bechamel_simple_cli.ml
@@ -10,7 +10,7 @@ let benchmark tests =

 let analyze raw_results =
   let ols =
-    Analyze.ols ~r_square:true ~bootstrap:0  ~predictors:[|Measure.run|]
+    Analyze.ols ~r_square:true ~bootstrap:100  ~predictors:[|Measure.run|]
   in
   let results =
     List.map (fun instance -> Analyze.all ols instance raw_results) instances in
@@ -33,6 +33,7 @@ let cli tests =
   let () =
   Hashtbl.find results (Measure.label Instance.monotonic_clock)
   |> Hashtbl.iter @@ fun name result ->
+  Format.eprintf "result: %a@." Analyze.OLS.pp result;
   try
       (* this relies on extracting input size from test name,

And then opam pin add bechamel.0.3.0 https://github.com/mirage/bechamel.git\#fix-confidence-ols and dune exec --profile=release bench/bench_encodings.exe:

Running benchmarks
result: { monotonic-clock per run = 8599.726949 (confidence: 8784.176427 to 8425.369278);
          r² = Some 0.950197 }
Encodings.validate/UTF8_XML:10000 = 1109.0 MiB/s
result: { monotonic-clock per run = 31.643024 (confidence: 32.521986 to 30.855090);
          r² = Some 0.974618 }
Encodings.validate/UTF8_XML:10 = 301.4 MiB/s
result: { monotonic-clock per run = 817.815610 (confidence: 834.109902 to 799.807154);
          r² = Some 0.969459 }
Encodings.validate/UTF8_XML:1000 = 1166.1 MiB/s
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  Encodings.validate/UTF8_XML:10     │             0.0000 mjw/run│             0.0036 mnw/run│             31.6430 ns/run│
│  Encodings.validate/UTF8_XML:1000   │             0.0000 mjw/run│             0.0288 mnw/run│            817.8156 ns/run│
│  Encodings.validate/UTF8_XML:10000  │             0.0000 mjw/run│             0.1815 mnw/run│           8599.7269 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

However running it again produces values that are outside the 95% CI of the previous run:

Running benchmarks
result: { monotonic-clock per run = 8065.231289 (confidence: 8300.807405 to 7878.781447);
          r² = Some 0.957846 }
Encodings.validate/UTF8_XML:10000 = 1182.5 MiB/s
result: { monotonic-clock per run = 29.007161 (confidence: 30.207457 to 28.080228);
          r² = Some 0.961252 }
Encodings.validate/UTF8_XML:10 = 328.8 MiB/s
result: { monotonic-clock per run = 799.910095 (confidence: 817.364304 to 776.586109);
          r² = Some 0.962951 }
Encodings.validate/UTF8_XML:1000 = 1192.2 MiB/s
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  Encodings.validate/UTF8_XML:10     │             0.0000 mjw/run│             0.0035 mnw/run│             29.0072 ns/run│
│  Encodings.validate/UTF8_XML:1000   │             0.0000 mjw/run│             0.0288 mnw/run│            799.9101 ns/run│
│  Encodings.validate/UTF8_XML:10000  │             0.0000 mjw/run│             0.1746 mnw/run│           8065.2313 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

Now that may not be the fault of bechamel, my system may genuinely introduce variance in a run that doesn't exist in the other, I'll take a closer look at that (well at least the numbers are plausibly close, and some do even overlap). I'll probably extract the raw numbers and do some side-by-side violin plots to figure out what is going on.

dinosaure commented 10 months ago

I think the PR is fine, can't spot anything wrong with it. Benchmark isolation and stability are topics to be solved outside of bechamel.

Thanks for your review, I will cut a release as soon as I can :+1:. And yes, the isolation is the main issue about bechamel (but outside the scope of it).

edwintorok commented 10 months ago

Did some more tests at home, similarly I get values outside of the 95% CI when repeating the benchmark even with exact same binary and following most of the settings at https://llvm.org/docs/Benchmarking.html. This can be seen even with the built-in 'fact' benchmark, e.g. sometimes I get 'factorial functional 50' and 'factorial imperative 100' to have nearly identical values, and sometimes one surpasses the other, but not always the same one, and their KDEs look different too (i.e. the calculation appears to be correct, the values are genuinely different from run to run, clustered around the central value). Stabilizing the GC or not doesn't seem to have an effect. I tried using 'core_bench' with similar results (using +time -ci-absolute), so whatever the problem is it affects not only bechamel.

The good news is that now I can easily get this data from bechamel and work on reducing noise in my measurement environment, or try to track down the source of the non-determinism! (perhaps by rerunning the benchmark several times and also merging results from that whole program rerun, otherwise even a 99% CI doesn't seem to help, there is always a small margin by which the next measurement escapes).

mirage / bechamel

Fix the boostrap calculation for the OLS analyze (see #42) #45