Paired Benchmarks - Githubissues

nvzqz / divan

Fast and simple benchmarking for Rust projects

https://nikolaivazquez.com/blog/divan/

Apache License 2.0

849 stars 24 forks source link

Paired Benchmarks #18

Open nvzqz opened 8 months ago

nvzqz commented 8 months ago

Paired benchmarking spreads measurement noise across benchmarks. It is used in Tango.

The general algorithm for measuring performance using paired benchmarking is as follows:

Prepare the same input data for both the baseline and candidate algorithms.

Execute the baseline algorithm and measure its execution time.

Execute the candidate algorithm and measure its execution time.

Record the difference in runtime between the baseline and candidate algorithms.

These steps constitute a single run, which serves as a distinct benchmark observation. Multiple runs are then performed, with each subsequent run employing a new payload and a randomized order in which the baseline and candidate functions are executed. Randomization is necessary to account for different CPU states and cache effects.

The advantages of using paired benchmarking include:

The difference metric is less susceptible to ephemeral biases in the system, as common-mode biases are eliminated.

Differences in means typically have lower variance since both algorithms are tested on the same input.

It becomes easier to identify and eliminate outliers that are not caused by algorithm changes, thereby enhancing test sensitivity.

@bazhenov, I would love to collaborate on how this approach would look like in Divan. 🙂

mert-kurttutan commented 7 months ago

Hi, this feature can be really fruitful for a project of mine. But, I am also interested comparing more than 2 versions of of the same function. Just putting some ideas out there, is there possibility of generalizing paired benchmarks to N-way benchmarks, where N is the number of versions of the function to compare?

bazhenov commented 7 months ago

is there possibility of generalizing paired benchmarks to N-way benchmarks

It is possible to build N-1 pairs measurements if it is solves your problem. Which leads me to the following:

But, I am also interested comparing more than 2 versions of of the same function.

It's an interesting use case. Can you elaborate more on the task your are trying to solve?

mert-kurttutan commented 7 months ago

My case: 1 rust version (= my implementation) + 2 C versions (with different algorithms) of a performance critical function. I am also measuring the scaling behaviour of the performance based on the input size.

In the case of N-1 pair it seems OK, but it also means running the inner versions twice as much. Let N_0, N_1, ..., N_n be different implementations of the same function, here the inner versions would be N_1,... N_{n-1}

nvzqz commented 7 months ago

I'm convinced that currently the best way to implement this today would be to abuse async/await to get coroutines so that we can swap out the user's stack between benchmarks. This could allow for Bencher::bench to not require a 'static lifetime, whereas boxing and storing Box<dyn Fn() -> impl Future> would require a 'static bound on the closure. I think this is feasible but unfortunately requires rewriting core components spanning from the Divan entry point through the Benchers passed to each benchmark.

But, I am also interested comparing more than 2 versions of of the same function.

I don't see why we would be limited to 2 benchmarks. We could theoretically interleave an arbitrary number of benchmark runs.

bazhenov commented 5 months ago

Sorry for the late reply. After several months of experience with different approaches in tango, I think I have more or less a complete approach to paired testing.

The way it works is the following. Each tango benchmark executable is a mixed object (binary/shared library). The binary part contains the tango benchmark runner, and the shared library part (FFI) contains the benchmark registry. Therefore runner can load benchmarks from self as well as any other tango binary. This way several benchmarks could be run simultaneously.

This IMO solves a lot of different problems:

this approach allows us to test the difference in performance not only between different algorithms in the same code but also between different versions of the same algorithm. It is very useful for performance regression testing, which is, I believe, by far the most prevalent version of performance testing
it is relatively easy to integrate into CI pipelines because benchmarks can be built independently from different VCS branches
testing the effect of non-code factors on the performance: compiler/linker options, different compiler versions, etc.
inter-language tests – binaries are not required to be written in the same language as long as they implement the same SPI in a compatible way
it allows to use of the same binary to perform both classical (pointwise) and paired testing. For classical tests, it's enough to have a single binary

Untitled Diagram drawio