Implementation of statistical analysis for all macro-benchmarks

Problematic

This Pull Request is the first (big) step towards fixing the very long standing issue in arewefastyet: reliability. Since the project was created we ran lengthy benchmarks (25-35 mins) and ran them only once per config (SHA + workload + benchmark reason). Leaving us with potentially impaired results and no ability to distinguish the noise from the actual results. It was common for arewefastyet to observe a ±1.5% variation in results for the exact same benchmark, which in the long run is not okay. More information on the problem in https://github.com/vitessio/arewefastyet/issues/515.

Proposed change

This Pull Request changes the entire benchmarking process: from when we spawn new benchmarks in the queue to when we render results on the UI. Along the way I made some minor refactoring and removed old and/or unused pieces of code such as: notification (was broken and will re-implement it later), some old tests, HTTP argument names, etc.

The main change in this Pull Request is that we are now running the exact same benchmark MaximumBenchmarkWithSameConfig times (constant set to 10) and we now use statistical analysis to observe and compare all of our results.

Queue

When the the CRON's handlers spawn new benchmarks they get added to the queue, before being added we multiply the benchmark by MaximumBenchmarkWithSameConfig minus the number of same benchmarks already being executed, in the queue and stored in the database as finished.

Execution

After these new benchmarks are added to the queue, they get consumed by the execution component. Now we add two new labels to Ansible: KeyLastIsSame and KeyNextIsSame, that allow us to optimize how we prepare and clean up the host machine with Ansible. KeyLastIsSame is set if the previous benchmark is the same as the current one in which case we don't have to prepare the host, and KeyNextIsSame allows us to skip the clean up at the end of the benchmark if the next benchmark is the same. The time spent on sysbench also changed from 600 seconds to 60 seconds, allowing us to re-run a benchmark in less than 2-3 minutes.

Not much else change in the execution.

Results

The API has changed quite a lot, we used to fetch all the results for a config and doing the median on all the results found. Instead, we now get all the results for a given benchmark and perform simple operations to get some statistics on the results: median, high, low, a 95% confidence interval. If we are just looking to get the results of one benchmark, we return those stats. Otherwise, if we want to compare two benchmarks, we take the two samples and their stats and perform statistical analysis using the Mann Whitney U test. From this, we get a delta in %, a P value and an Alpha value, enabling us to say if we can make statistical inference from the two samples. We are relying on some of the functions used by benchmath for this whole process.

Website

The website has been changed to reflect all the new measurements and changes done to the API. Here is what the new comparison table looks like when comparing v19.0.1 and a recent commit on Vitess' main:

vitessio / arewefastyet