mtrudel / bandit

Bandit is a pure Elixir HTTP server for Plug & WebSock applications
MIT License
1.69k stars 82 forks source link

Bandit collapsing under load #412

Open atavistock opened 2 weeks ago

atavistock commented 2 weeks ago

I've been working on performance benchmarks for Phoenix over at https://www.techempower.com/benchmarks and we've tuned a lot of things up a lot (with Bandit helping a lot!). Unfortunately I think I've discovered a load related bug in Bandit (or ThousandIsland) in the process.

We have an extremely small controller function that simply send plain text: https://github.com/TechEmpower/FrameworkBenchmarks/blob/7aca29cd3f433f3a9e48fa8436bd44fe670358e1/frameworks/Elixir/phoenix/lib/hello_web/controllers/page_controller.ex#L81 . Its also notably the plug chain is intentionally as minimal as it can be, so this is as close to the simplest controller action that could be triggered.

The automated tests were showing no connections successful. With some research it looks like the server stopped accepting requests after the warmup phase. Thats 512 concurrent workers with 10 threads each, the stats look great until it just stops accepting requests.

So I setup for some local tests and just used Apache Bench with 160 concurrent workers sending a total of 10,000 requests (ab -c 160 -n 10000 http://localhost:4000/plaintext) . And it looks like while Cowboy is slow and gets even slower under load it eventual returns every one of those requests (with a P95 of 14 seconds). By contrast Bandit just blazes through 6000 requests, but never makes it to 7000 as it seems to start dropping every incoming connection and just freezes for 20-30 seconds.

Its fair to say this is probably not reflective of most real world use cases, but its not actually unreasonable for a high load site that has a status or ping for client presence. In any case... Let me know if theres any more information that would help or if theres anything I can assist with.

mtrudel commented 2 weeks ago

Thanks for this! The 'timing out for 30s' sounds an awful lot like a TIME_WAIT issue. What OSs are you on in both of those scenarios, and what clients are running in the automated test?

atavistock commented 2 weeks ago

Sure.

My local tests are using Apache Bench on a Macbook Pro M2 running OS/X Sonoma 14.5

I'm not 100% confident on the Tech Empowerment setup. The best of my knowledge is that they run the tests twice, once on a a Dell R440 Xeon Gold and once in some cloud provider, but the tests actually run in a docker container using Ubuntu 24.04. The tests themselves look bespoke using python, lua, and https://github.com/wg/wrk

atavistock commented 2 weeks ago

I was working with @josevalim on this and he offered to lend a hand if theres anything we could do.

mtrudel commented 1 day ago

Can confirm that local tests on macOS with ab hang in a TIME_WAIT state (repro: set up a local Bandit, run ab -c 160 -n 10000 http://localhost:4000/plaintext locally and watch the output of netstat -n). This is as expected with ab; it's a deeply flawed tool that has a ton of straight-up-broken aspects, including in this case the fact that since it disables keepalive by default, Bandit is forced to close the connection on every request which will very quickly consume the ~16k or so valid {src_port, dest_port} pairs that need to linger in TIME_WAIT until they expire.

mtrudel commented 1 day ago

I should also note that Cowboy is also not immune to the TIME_WAIT issue (nor is anyone else; it's inherent in TCP). What you were seeing above with long timeouts in Cowboy was probably just a fortunately timed set of requests where you managed to get in after the TIME_WAIT window expired. The fundamental issue underneath is still there.

I did however find an issue in how Bandit handles keepalives for HTTP/1.0 connections which would cause ab to hang on Bandit when the -k option was provided. I've fixed this in main and will be shipping it out in the next point release (coming in the next few days). If the Tech Empower benchmarks also use HTTP/1.0 clients that would explain what they're seeing as well.

mtrudel commented 1 day ago

Fix for this issue is on main