Noise in Sandmark - Githubissues

kayceesrk commented 3 years ago

Following the discussion in https://github.com/ocaml/ocaml/pull/9934, I set out to quantify the noise in Sandmark macrobenchmark runs. Before asking complex questions about loop alignments and microarchitectural optimisations as was done in https://github.com/ocaml/ocaml/pull/10039, I wanted to measure the noise between multiple runs of the same code. It is important to note that currently, we only run a single iteration of each variant.

The benchmarking was done on IITM "turing" which is a Intel Xeon Gold 5120 CPU machine with isolated cores, cpu governer set to performance, hyper threading disabled, turbo boost disabled, interrupts and rcu_callbacks directed to non-isolated cores but ASLR on [1]. The result on two runs of the latest commit from https://github.com/stedolan/ocaml/tree/sweep-optimisation is here:

The outlier is worrisome, but there is up to 2% difference in both directions. Moving forward, we should consider the following:

Arrive at a measure for statistical significance on a given machine. What would be the minimum difference beyond which the result is statistically significant. This will vary based on the benchmark and the topic (running time, maxRSS).
Run multiple iterations. Sandmark already has an ITER variable which runs the experiments for multiple runs. The notebooks need to be updated so that mean (and standard deviation) are computed first and the graphs are updated to include error bars. The downside is that the benchmarking will take significantly longer. We should choose a representative set of macro benchmarks for quick study and reserve the full macro benchmark run for the final result. Can we run the sequential macro benchmarks in parallel on different isolated cores? What would be the impact of this on individual benchmark runs?

[1] https://github.com/ocaml-bench/ocaml_bench_scripts#notes-on-hardware-and-os-settings-for-linux-benchmarking

kayceesrk commented 3 years ago

Now with ASLR turned off

Still the noise is around 2%.

gasche commented 3 years ago

It looks like most of the benchmarks are not actually very noisy (the noise observed is well below 1%), and a smaller group of benchmarks are noisier. This suggests that you could track per benchmark how many iterations to run and what's the expected noise level, and try to have more stable benchmarks and fewer unstable benchmarks to keep the total running time in check. In particular, most of the long-running benchmarks are not noisy in this test, so maybe they could be run a single time. knucleotide is the only >=10s noisy benchmark.

Regarding error bars: instead of error bars specific to a run (this requires several iterations to show error bars), you could store for each benchmark its typical "noise range" (the largest observed difference in past noise-detecting runs), and display those as error bars for all future runs. This gives good visual feedback when looking at benchmark graphs, without requiring several iterations.

shakthimaan commented 2 years ago

Noise has been reported for the soli benchmark for 5.1.0+trunk. Reference: https://github.com/ocaml/ocaml/pull/11102#issuecomment-1191269903

kayceesrk commented 2 years ago

Noise has been reported for the soli benchmark for 5.1.0+trunk. Reference: https://github.com/ocaml/ocaml/pull/11102#issuecomment-1191269903

As mentioned in the linked comment, soli running time is too small. Either we make it run longer or remove the macro benchmark tag. It was a mistake to have tagged it as a macro benchmark in the first place.

See https://github.com/ocaml-bench/sandmark/issues/348. I believe I've fixed a number of these to run longer.

ocaml-bench / sandmark

Noise in Sandmark #198