Dynamic number of benchmark repetitions

ggwpez commented 1 year ago

Problem

The benchmark pallet command is invoked with --steps and --repeat arguments which decide how often a benchmark should be repeated.
This can cause issues for very short benchmarks since the time-frame on the machine is too short to get constent performance.
It especially applies to VMs which can have CPU steal and other kind of short-lived disturbances, which do not impair the average performance but destroy the bench results.

Proposed Solution

Instead of running each bench with a fixed number of repetitions, we could look to criterion for more statistical driven results.
For example running the benchmark until we get consistent average results and error otherwise with "inconsistent hardware detected".

Ideally we could also get nicer prints to directly compare with past results :smile:

Benchmarking DAG 1k/1k: Collecting 100 samples in estimated 5.0000 s (133M iter)
DAG 1k/1k               time:   [37.627 ns 37.723 ns 37.834 ns]
                        change: [-29.326% -28.527% -27.771%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  7 (7.00%) high severe

bkchr commented 1 year ago

For example running the benchmark until we get consistent average results and error otherwise with "inconsistent hardware detected".

When we have long running benchmarks, we will have problems :P

What I just thought about for this issue, could we maybe not use criterion? Or would that be too crazy?

koute commented 1 year ago

What I just thought about for this issue, could we maybe not use criterion? Or would that be too crazy?

Unfortunately criterion doesn't seem to expose APIs which are low level enough for us to use as-is, but considering it is relatively small (~10k lines of code, most of which we shouldn't need anyway) and liberally licensed we could just copy the relevant code verbatim into a separate helper crate and use that; should be relatively straightforward to do from what I can see.

ggwpez commented 1 year ago

What I just thought about for this issue, could we maybe not use criterion? Or would that be too crazy?

Criterion requires nightly, but we could maybe work around that. Anyway for a first version I would just hack it together instead of doing a larger refactor to integrate Criterion.

paritytech / polkadot-sdk

Dynamic number of benchmark repetitions #379

Problem

Proposed Solution