the-guild-org / gateways-benchmark

MIT License
40 stars 7 forks source link

Latency vs Throughput benchmark #370

Open Finistere opened 4 months ago

Finistere commented 4 months ago

Hello!

Creating this issue to discuss what could be done to imitate what I did in the benchmark blogpost.

Currently, all benchmarks in this repository work by having a single K6 load test with a given scenario. It's unfortunately not enough to generate the latency vs throughput graphs I generated. Multiple benchmark runs are needed with a fixed arrival rate. The tricky part is that obviously, each gateway has a different maximum throughput. So to automate it, it would require a loop that does something like this:

# pseudo code

# Warmup for a few seconds
k6_run_fixed_arrival_rate(100)

# actual benchmark
arrival_rate = 100
step = 100
while true:
    results = k6_run_fixed_arrival_rate(arrival_rate)
    if (results['metrics']['iterations']['values']['rate'] - arrival_rate).abs() < 3:
        arrival_rate += step
    else:
       step = step/2
       if step <= 5:
           break
       arrival_rate -= step

That's roughly what I did. Manually 😢. The more difficult part is that I adjusted preAllocatedVUs over the benchmark runs, but it's only important when close to being CPU-bound. So for an automated benchmark, I would just set it to something high enough for all gateways and that's it. High enough means that you shouldn't have a case where the actual throughput measured by K6 doesn't match the fixed arrival rate with a gateway that wasn't CPU-bound (max cpu < CPU_LIMIT).

The last part is something that could be done with some bash. What's particularly unclear to me is how you should generate the results table/graph. In my benchmarks, I wrote the results in JSON from docker stats and k6 and processed them in a Python notebook, available here. I didn't clean it up, but if you're familiar with Python it shouldn't be too hard, I hope. :)

If you're considering adding this benchmark, I would also suggest adding a 10ms subgraph delay like I did in the end, it's more representative of a real-world case. In the current benchmarks, it's not noticeable (unless you add huge delays) because most of the latency comes from CPU contention in the first place. Adding a few dozen ms of network delay can't impact latencies in the order of seconds.

In your stead, I would also consider removing Grafana, Prometheus, and Cadvisor and instead, only use docker stats or the Docker API to retrieve the resource consumption information. I would expect it to be the most precise measurement you can have easily and it would avoid using resources for this. This is not a small change, so well. 🤷

Regarding using the host network for docker, I did see an impact without any network delay for Grafbase. I don't remember if it did for the others. So I would tend to recommend it, but I'm not sure whether this works on MacOS/Windows.

Finistere commented 4 months ago

Thinking a bit more about it, I could automate what I did in Python with a docker image and some bash. At least have some base. Writing K6 metrics file & docker stats file for each run. But not sure how helpful it would be for you to have something that different from the rest, especially if you're not familiar with Python. Unfortunately, JS is not my strength. ^^"