Open atavistock opened 2 weeks ago
Thanks for this! The 'timing out for 30s' sounds an awful lot like a TIME_WAIT
issue. What OSs are you on in both of those scenarios, and what clients are running in the automated test?
Sure.
My local tests are using Apache Bench on a Macbook Pro M2 running OS/X Sonoma 14.5
I'm not 100% confident on the Tech Empowerment setup. The best of my knowledge is that they run the tests twice, once on a a Dell R440 Xeon Gold and once in some cloud provider, but the tests actually run in a docker container using Ubuntu 24.04. The tests themselves look bespoke using python, lua, and https://github.com/wg/wrk
I was working with @josevalim on this and he offered to lend a hand if theres anything we could do.
Can confirm that local tests on macOS with ab
hang in a TIME_WAIT
state (repro: set up a local Bandit, run ab -c 160 -n 10000 http://localhost:4000/plaintext
locally and watch the output of netstat -n
). This is as expected with ab
; it's a deeply flawed tool that has a ton of straight-up-broken aspects, including in this case the fact that since it disables keepalive by default, Bandit is forced to close the connection on every request which will very quickly consume the ~16k or so valid {src_port, dest_port}
pairs that need to linger in TIME_WAIT
until they expire.
I should also note that Cowboy is also not immune to the TIME_WAIT
issue (nor is anyone else; it's inherent in TCP). What you were seeing above with long timeouts in Cowboy was probably just a fortunately timed set of requests where you managed to get in after the TIME_WAIT window expired. The fundamental issue underneath is still there.
I did however find an issue in how Bandit handles keepalives for HTTP/1.0 connections which would cause ab to hang on Bandit when the -k
option was provided. I've fixed this in main and will be shipping it out in the next point release (coming in the next few days). If the Tech Empower benchmarks also use HTTP/1.0 clients that would explain what they're seeing as well.
Fix for this issue is on main
I've been working on performance benchmarks for Phoenix over at https://www.techempower.com/benchmarks and we've tuned a lot of things up a lot (with Bandit helping a lot!). Unfortunately I think I've discovered a load related bug in Bandit (or ThousandIsland) in the process.
We have an extremely small controller function that simply send plain text: https://github.com/TechEmpower/FrameworkBenchmarks/blob/7aca29cd3f433f3a9e48fa8436bd44fe670358e1/frameworks/Elixir/phoenix/lib/hello_web/controllers/page_controller.ex#L81 . Its also notably the plug chain is intentionally as minimal as it can be, so this is as close to the simplest controller action that could be triggered.
The automated tests were showing no connections successful. With some research it looks like the server stopped accepting requests after the warmup phase. Thats 512 concurrent workers with 10 threads each, the stats look great until it just stops accepting requests.
So I setup for some local tests and just used Apache Bench with 160 concurrent workers sending a total of 10,000 requests (
ab -c 160 -n 10000 http://localhost:4000/plaintext
) . And it looks like while Cowboy is slow and gets even slower under load it eventual returns every one of those requests (with a P95 of 14 seconds). By contrast Bandit just blazes through 6000 requests, but never makes it to 7000 as it seems to start dropping every incoming connection and just freezes for 20-30 seconds.Its fair to say this is probably not reflective of most real world use cases, but its not actually unreasonable for a high load site that has a status or ping for client presence. In any case... Let me know if theres any more information that would help or if theres anything I can assist with.