Implement Statistical Analysis for all Benchmarks

Description

To improve the reliability of our benchmarks and the trust we have in them, we should implement proper statistical analysis of our different benchmarking results. This method will allow us to aggregate the results of multiple benchmark runs for the same commit and config to provide more accurate results and comparisons.

To highlight the lack of reliability of our benchmarks here is a screenshot of 4 results (each row) for the exact same benchmark (same vitess commit, version of arewefastyet, workload, flags, etc). We can see a (non)impressive ~1.5% variation between the first benchmark and the second benchmark in total_qps.

Current implementation

Currently arewefastyet benchmarks a commit only once per workload (TPC-C, OLTP, etc...). This run usually lasts for about 23 minutes including the setup and teardown of the host server. The result of this run is then stored in the database and used to do comparisons with other commits/tags.

When a benchmark starts we prepare the host machine, which takes about 6 minutes, we clean up everything (disk, processes, vitess cluster, data, etc), we install the binaries (vitess + arewefastyet), we build them, we setup the vitess cluster. Then benchmarking workload runs for approximately 12 minutes (preparation, warm up, benchmark). Finally, we clean up the host server by repeating some of the steps from the preparation phase : clean up of disk, processes, vitess cluster, etc, this step takes about 5 minutes to complete.

To compare two benchmarks we currently fetch all the results for a given configuration (commit + workload + reason of the benchmark) average them all, and do the same thing for the benchmark we want to compare against. Once we have our two averaged benchmark results, we calculate the % difference between the first and second results. And we get the results that we can see on: https://benchmark.vitess.io/macro?ltag=18.0.2&rtag=main.

Desired implementation

In order to do proper statistical analysis we must run the same benchmark for the same commit and workload multiple times. Running it 10 times will provide us with a good enough p-value. However, running the same benchmark 10 times would take ~23 mins * 10 = 3h48m, which is not even considerable given that on a normal day we can run more than 20 different benchmarks.

While the time taken to run a benchmark is the main challenge here, to adequately compare and use the results of a given benchmark (using the 10 runs), we will have to modify the comparison and fetch logic of arewefastyet. While the comparison implementation detailed in the section works today, it will not allow us to do calculate statistical inference. We can use the package benchmath which is natively shipped by Go: https://pkg.go.dev/golang.org/x/perf/benchmath to compare results of two sets of benchmarks.

Solution 1: Optimize our Ansible playbooks and avoid cleaning when not required

Last month I started working on removing several useless steps from our Ansible playbooks and on adding a system that skips certain steps of the setup and cleaning when not required. For instance, not re-installing and re-building Vitess binaries if the previous benchmark was using the same configuration.

Here is the list of PRs:

I ended up reverting these changes in https://github.com/vitessio/arewefastyet/pull/512 as they were causing troubles to the stability of our benchmarks and because the time gained from those optimization were very minor and wouldn't have allowed us to run the benchmark enough time to get a good p-value while taking a decent amount of time.

Step	Time before	Time after
Prepare	6 min	4 min
Clean up	5 min	5 min

Solution 2: Re-use the same Vitess cluster

As observed in my experimentations of Solution 1, setting up and tearing down the Vitess cluster is what takes the longest time. We first setup etcd, followed by vtctld, and then followed by setting up 4 vttablets + 4 underlying MySQLs, and finally setting up 3 vtgates, finally we apply the vschema and other settings. This whole process takes a while as we must wait on different sockets and process to initialize.

I think we could create the vitess cluster only once, for the first of the 10 benchmarks and then skip the clean up and setting up phases until the end of the 10th benchmark. The only things we will have to clean up between each run is the data in the table (which i think is handled by our workload's schemas, with DROP TABLE at the beginning), and probably stop all connections and prune the cache.

vitessio / arewefastyet