Latency vs Throughput benchmark

Hello!

Creating this issue to discuss what could be done to imitate what I did in the benchmark blogpost.

Currently, all benchmarks in this repository work by having a single K6 load test with a given scenario. It's unfortunately not enough to generate the latency vs throughput graphs I generated. Multiple benchmark runs are needed with a fixed arrival rate. The tricky part is that obviously, each gateway has a different maximum throughput. So to automate it, it would require a loop that does something like this:

# pseudo code

# Warmup for a few seconds
k6_run_fixed_arrival_rate(100)

# actual benchmark
arrival_rate = 100
step = 100
while true:
    results = k6_run_fixed_arrival_rate(arrival_rate)
    if (results['metrics']['iterations']['values']['rate'] - arrival_rate).abs() < 3:
        arrival_rate += step
    else:
       step = step/2
       if step <= 5:
           break
       arrival_rate -= step

That's roughly what I did. Manually 😢. The more difficult part is that I adjusted preAllocatedVUs over the benchmark runs, but it's only important when close to being CPU-bound. So for an automated benchmark, I would just set it to something high enough for all gateways and that's it. High enough means that you shouldn't have a case where the actual throughput measured by K6 doesn't match the fixed arrival rate with a gateway that wasn't CPU-bound (max cpu < CPU_LIMIT).

The last part is something that could be done with some bash. What's particularly unclear to me is how you should generate the results table/graph. In my benchmarks, I wrote the results in JSON from docker stats and k6 and processed them in a Python notebook, available here. I didn't clean it up, but if you're familiar with Python it shouldn't be too hard, I hope. :)

If you're considering adding this benchmark, I would also suggest adding a 10ms subgraph delay like I did in the end, it's more representative of a real-world case. In the current benchmarks, it's not noticeable (unless you add huge delays) because most of the latency comes from CPU contention in the first place. Adding a few dozen ms of network delay can't impact latencies in the order of seconds.

In your stead, I would also consider removing Grafana, Prometheus, and Cadvisor and instead, only use docker stats or the Docker API to retrieve the resource consumption information. I would expect it to be the most precise measurement you can have easily and it would avoid using resources for this. This is not a small change, so well. 🤷

Regarding using the host network for docker, I did see an impact without any network delay for Grafbase. I don't remember if it did for the others. So I would tend to recommend it, but I'm not sure whether this works on MacOS/Windows.

the-guild-org / gateways-benchmark

Latency vs Throughput benchmark #370