Open amitsaxena opened 3 months ago
Perhaps related: https://github.com/puma/puma/issues/2998#issuecomment-1863828162
Can we trust ab
? https://httpd.apache.org/docs/2.4/programs/ab.html states under "Bugs"
It does not implement HTTP/1.x fully; only accepts some 'expected' forms of responses.
Puma has been benchmarked extensively using the wrk
fork by @ioquatix https://github.com/ioquatix/wrk, see https://github.com/puma/puma/blob/master/docs/testing_benchmarks_local_files.md for some docs about that
I tried this myself just now:
$ WEB_CONCURRENCY=2 PUMA_MAX_THREADS=1 b e puma --log-requests hello.ru
[43957] Puma starting in cluster mode...
[43957] * Puma version: 6.4.2 (ruby 3.2.4-p170) ("The Eagle of Durango")
[43957] * Min threads: 0
[43957] * Max threads: 1
[43957] * Environment: development
[43957] * Master PID: 43957
[43957] * Workers: 2
[43957] * Restarts: (✔) hot (✖) phased
[43957] * Preloading application
[43957] * Listening on http://0.0.0.0:9292
[43957] Use Ctrl-C to stop
[43957] - Worker 0 (PID: 43960) booted in 0.0s, phase: 0
[43957] - Worker 1 (PID: 43961) booted in 0.0s, phase: 0
$ ./wrk --duration 10 --connections 2 --threads 1 --latency http://127.0.0.1:9292
Running 10s test @ http://127.0.0.1:9292
1 threads and 2 connections
connection 0: 241766 requests completed
connection 1: 241620 requests completed
Thread Stats Avg Stdev Max +/- Stdev
Latency 42.15us 14.34us 584.00us 96.02%
Req/Sec 48.09k 4.77k 57.63k 96.04%
Latency Distribution
50% 40.00us
75% 46.00us
90% 48.00us
99% 108.00us
483386 requests in 10.10s, 35.04MB read
Requests/sec: 47854.12
Transfer/sec: 3.47MB
I think you're just experiencing the nature of keepalive? Try more Puma threads and/or workers?
Using https://github.com/rakyll/hey I see the same thing as you:
$ hey -n 5000 -c 100 http://127.0.0.1:9292
Summary:
Total: 3.5634 secs
Slowest: 1.7245 secs
Fastest: 0.0000 secs
Average: 0.0288 secs
Requests/sec: 1403.1650
Total data: 55000 bytes
Size/request: 11 bytes
Response time histogram:
0.000 [1] |
0.172 [4791] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.345 [114] |■
0.517 [7] |
0.690 [12] |
0.862 [10] |
1.035 [8] |
1.207 [11] |
1.380 [21] |
1.552 [11] |
1.725 [14] |
Latency distribution:
10% in 0.0000 secs
25% in 0.0001 secs
50% in 0.0001 secs
75% in 0.0002 secs
90% in 0.0028 secs
95% in 0.0411 secs
99% in 1.0752 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0000 secs, 1.7245 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0012 secs
resp wait: 0.0288 secs, 0.0000 secs, 1.7245 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0019 secs
Status code distribution:
[200] 5000 responses
$ hey -n 5000 -c 100 -disable-keepalive http://127.0.0.1:9292
Summary:
Total: 0.1860 secs
Slowest: 0.0084 secs
Fastest: 0.0017 secs
Average: 0.0037 secs
Requests/sec: 26887.0389
Total data: 55000 bytes
Size/request: 11 bytes
Response time histogram:
0.002 [1] |
0.002 [17] |
0.003 [816] |■■■■■■■■■■■
0.004 [2902] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.004 [792] |■■■■■■■■■■■
0.005 [80] |■
0.006 [67] |■
0.006 [151] |■■
0.007 [64] |■
0.008 [105] |■
0.008 [5] |
Latency distribution:
10% in 0.0030 secs
25% in 0.0032 secs
50% in 0.0034 secs
75% in 0.0038 secs
90% in 0.0043 secs
95% in 0.0060 secs
99% in 0.0073 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0003 secs, 0.0017 secs, 0.0084 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0015 secs
resp wait: 0.0033 secs, 0.0012 secs, 0.0050 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0020 secs
Status code distribution:
[200] 5000 responses
If I run Puma with more threads: WEB_CONCURRENCY=2 b e puma --threads 16:16 hello.ru
Requests/sec
matches the numbers we get with -disable-keepalive
and 1 thread per Puma worker
$ hey -n 5000 -c 100 http://127.0.0.1:9292
Summary:
Total: 0.3199 secs
Slowest: 0.3117 secs
Fastest: 0.0000 secs
Average: 0.0031 secs
Requests/sec: 15631.2688
Total data: 55000 bytes
Size/request: 11 bytes
Response time histogram:
0.000 [1] |
0.031 [4911] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.062 [10] |
0.094 [10] |
0.125 [11] |
0.156 [8] |
0.187 [10] |
0.218 [9] |
0.249 [15] |
0.281 [9] |
0.312 [6] |
Latency distribution:
10% in 0.0001 secs
25% in 0.0001 secs
50% in 0.0001 secs
75% in 0.0001 secs
90% in 0.0001 secs
95% in 0.0001 secs
99% in 0.1502 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0000 secs, 0.0000 secs, 0.3117 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0009 secs
resp wait: 0.0030 secs, 0.0000 secs, 0.3103 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0002 secs
Status code distribution:
[200] 5000 responses
I think you're just experiencing the nature of keepalive?
@dentarg I am not sure if I understand. Can you please elaborate. 🙏
Can we trust ab? https://httpd.apache.org/docs/2.4/programs/ab.html states under "Bugs"
The apache bench tests were only used to reproduce the problem locally. Our AWS containers were benchmarked using https://loader.io/ . So it's not necessarily dependent on apache bench I think.
Try more Puma threads and/or workers?
Not sure if that's a solution. To me it sounds like throwing money at the problem. What we were doing here is benchmarking similar hardware on Heroku against AWS and are seeing exponentially worse latencies in case of AWS on identical stacks and same puma configuration. So tuning puma threads and workers isn't the objective here, but the stark difference in latencies between identical hardware and config is a matter of concern. So effectively we are getting very poor price to performance ratio on AWS due to this issue.
But you have completely different software in front of Puma with Heroku vs AWS?
You mean the app which is being benchmarked? It's the same app, with exactly same puma config and environment variables.
If you meant the load testing tool, then they were both benchmarked using loader.io and also using apache bench and the problem is easily reproducible on each run.
Each set of test was conducted using identical tooling and on exactly the same app. The only variable is Heroku vs AWS here.
And the keep-alive issue was identified after recording all the traffic on AWS using tcpdump
and looking at request headers for different kinds of traffic. To zero in on the issue we started hitting the Fargate container directly from within the AWS VPC, via the load balancer, using the public IP, etc. and eventually found this difference. When coming via load balancer we were seeing high latencies (due to keep-alive) compared to when hitting the container IP directly using apache bench. After this we were easily able to reproduce the issue by directly hitting the public IP of the Fargate container with and without the keep alive option using apache bench.
Another data point is when looking at number of connections in cat /proc/net/tcp
during load tests, I found 312 entries in case of high latency and >13K entries for a low latency test. Not sure if it helps, but throwing it in here in case it gives some indications.
Also at no point in time during the tests we were hitting 100% CPU or memory, so resources isn't a problem here.
If you aren't maxing out resources, you have room for more Puma threads? :)
I can understand that you expect similar performance or better with keep-alive. It makes sense. Someone needs to check why that isn't happening with Puma and one thread. I'll may look if/when I have some time.
The reason why you don't have the problem with Heroku looks be that keep-alive isn't used between the Heroku router and your app (Puma): https://devcenter.heroku.com/articles/http-routing#http-versions-supported
If you aren't maxing out resources, you have room for more Puma threads? :)
Haha! I hear you :) The benchmarks were conducted on a simple health check which just returns the current timestamp to not introduce variables like database, cache, application code, etc. So not using 100% CPU and memory in benchmarks doesn't necessarily mean that we can increase the number of threads. Actual production traffic memory consumption looks different. We can go with bigger containers to solve it temporarily, but the base problem stays IMHO even on increasing number of threads IMHO.
Someone needs to check why that isn't happening with Puma and one thread
I don't think it's a one thread problem only. I can reproduce it with 2 and 4 threads on the sample as well. Please look at apache bench results below. I think what we are doing here is generating a much larger amount of parallelization, and that reduces the average latency. If we go too high on the resource pool (puma threads), the massive parallelization will reduce average latency to comparable levels, but that doesn't necessarily mean that the problem disappears on high thread count. If I go higher on the concurrency with the increased number of threads, I'll easily reproduce the problem again.
2 threads without keep-alive:
$ ab -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 0.199 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 380000 bytes
HTML transferred: 55000 bytes
Requests per second: 25151.28 [#/sec] (mean)
Time per request: 3.976 [ms] (mean)
Time per request: 0.040 [ms] (mean, across all concurrent requests)
Transfer rate: 1866.70 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 4
Processing: 1 4 0.5 4 9
Waiting: 1 4 0.5 4 9
Total: 2 4 1.0 4 12
Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 4
80% 4
90% 4
95% 7
98% 7
99% 8
100% 12 (longest request)
2 thread with keep-alive:
$ ab -k -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 5.176 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4910
Total transferred: 497840 bytes
HTML transferred: 55000 bytes
Requests per second: 966.06 [#/sec] (mean)
Time per request: 103.514 [ms] (mean)
Time per request: 1.035 [ms] (mean, across all concurrent requests)
Transfer rate: 93.93 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 7
Processing: 0 53 421.6 0 5066
Waiting: 0 53 421.6 0 5066
Total: 0 53 421.6 0 5066
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 102
99% 2612
100% 5066 (longest request)
> 25 times latency with keep-alive
4 threads without keep-alive:
$ ab -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 0.203 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 380000 bytes
HTML transferred: 55000 bytes
Requests per second: 24622.78 [#/sec] (mean)
Time per request: 4.061 [ms] (mean)
Time per request: 0.041 [ms] (mean, across all concurrent requests)
Transfer rate: 1827.47 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 5
Processing: 2 4 0.5 4 7
Waiting: 1 4 0.5 4 7
Total: 3 4 1.0 4 11
Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 4
80% 4
90% 4
95% 6
98% 7
99% 9
100% 11 (longest request)
4 threads with keep-alive:
$ ab -k -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 2.617 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4909
Total transferred: 497816 bytes
HTML transferred: 55000 bytes
Requests per second: 1910.25 [#/sec] (mean)
Time per request: 52.349 [ms] (mean)
Time per request: 0.523 [ms] (mean, across all concurrent requests)
Transfer rate: 185.73 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 4
Processing: 0 27 207.4 0 2525
Waiting: 0 27 207.4 0 2525
Total: 0 27 207.4 0 2525
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 0
95% 0
98% 81
99% 1329
100% 2525 (longest request)
> 13 times latency with keep-alive
Results with 16 puma thread and 2 workers. The difference in latencies is still very clearly visible (4ms vs 14ms) IMHO. In the tests below you can also clearly see increase in latency as I keep going up on concurrency. I couldn't go beyond 140 as apache bench crashes then.
$ ab -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 0.202 seconds
Complete requests: 5000
Failed requests: 0
Total transferred: 380000 bytes
HTML transferred: 55000 bytes
Requests per second: 24722.98 [#/sec] (mean)
Time per request: 4.045 [ms] (mean)
Time per request: 0.040 [ms] (mean, across all concurrent requests)
Transfer rate: 1834.91 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.8 0 5
Processing: 3 4 0.5 4 7
Waiting: 1 4 0.5 4 7
Total: 3 4 1.1 4 10
Percentage of the requests served within a certain time (ms)
50% 4
66% 4
75% 4
80% 4
90% 5
95% 7
98% 7
99% 9
100% 10 (longest request)
$ ab -k -n 5000 -c 100 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 100
Time taken for tests: 0.720 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4903
Total transferred: 497672 bytes
HTML transferred: 55000 bytes
Requests per second: 6944.43 [#/sec] (mean)
Time per request: 14.400 [ms] (mean)
Time per request: 0.144 [ms] (mean, across all concurrent requests)
Transfer rate: 675.01 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 4
Processing: 0 7 47.8 0 629
Waiting: 0 7 47.8 0 629
Total: 0 7 47.8 0 629
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 1
90% 1
95% 1
98% 68
99% 298
100% 629 (longest request)
Higher concurrency tests:
$ ab -k -n 5000 -c 110 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 110
Time taken for tests: 0.754 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4896
Total transferred: 497504 bytes
HTML transferred: 55000 bytes
Requests per second: 6630.44 [#/sec] (mean)
Time per request: 16.590 [ms] (mean)
Time per request: 0.151 [ms] (mean, across all concurrent requests)
Transfer rate: 644.27 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 3
Processing: 0 8 56.6 0 657
Waiting: 0 8 56.6 0 657
Total: 0 9 56.6 0 657
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 1
80% 1
90% 1
95% 1
98% 86
99% 327
100% 657 (longest request)
$ ab -k -n 5000 -c 120 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 120
Time taken for tests: 0.787 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4860
Total transferred: 496640 bytes
HTML transferred: 55000 bytes
Requests per second: 6357.27 [#/sec] (mean)
Time per request: 18.876 [ms] (mean)
Time per request: 0.157 [ms] (mean, across all concurrent requests)
Transfer rate: 616.66 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 5
Processing: 0 10 64.5 0 691
Waiting: 0 10 64.5 0 691
Total: 0 10 64.5 0 691
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 1
80% 1
90% 1
95% 7
98% 95
99% 460
100% 691 (longest request)
$ ab -k -n 5000 -c 130 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
Server Software:
Server Hostname: 0.0.0.0
Server Port: 9292
Document Path: /
Document Length: 11 bytes
Concurrency Level: 130
Time taken for tests: 0.939 seconds
Complete requests: 5000
Failed requests: 0
Keep-Alive requests: 4909
Total transferred: 497816 bytes
HTML transferred: 55000 bytes
Requests per second: 5325.01 [#/sec] (mean)
Time per request: 24.413 [ms] (mean)
Time per request: 0.188 [ms] (mean, across all concurrent requests)
Transfer rate: 517.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.5 0 4
Processing: 0 12 75.0 0 843
Waiting: 0 12 75.0 0 843
Total: 0 12 75.0 0 843
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 1
80% 1
90% 1
95% 1
98% 152
99% 508
100% 843 (longest request)
$ ab -k -n 5000 -c 140 http://0.0.0.0:9292/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 0.0.0.0 (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
apr_socket_recv: Connection reset by peer (54)
Total of 2660 requests completed
If I add max_fast_inline 0
to the Puma config, hey -n 5000 -c 100 http://127.0.0.1:9292
gives me Requests/sec: 26195.1310
for 2 puma workers with 1 thread each.
$ hey -n 5000 -c 100 http://127.0.0.1:9292
Summary:
Total: 0.1909 secs
Slowest: 0.0107 secs
Fastest: 0.0017 secs
Average: 0.0038 secs
Requests/sec: 26195.1310
Total data: 55000 bytes
Size/request: 11 bytes
Response time histogram:
0.002 [1] |
0.003 [25] |
0.004 [2389] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.004 [2044] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
0.005 [189] |■■■
0.006 [127] |■■
0.007 [99] |■■
0.008 [59] |■
0.009 [50] |■
0.010 [11] |
0.011 [6] |
Latency distribution:
10% in 0.0031 secs
25% in 0.0033 secs
50% in 0.0035 secs
75% in 0.0037 secs
90% in 0.0045 secs
95% in 0.0060 secs
99% in 0.0083 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0001 secs, 0.0017 secs, 0.0107 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0000 secs
req write: 0.0000 secs, 0.0000 secs, 0.0011 secs
resp wait: 0.0035 secs, 0.0007 secs, 0.0068 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0020 secs
Status code distribution:
[200] 5000 responses
The reason why you don't have the problem with Heroku looks be that keep-alive isn't used between the Heroku router and your app (Puma): https://devcenter.heroku.com/articles/http-routing#http-versions-supported
Also thanks for this! 🙏 ⭐
This clearly validates my hypothesis as that's the only difference between the 2 environments - AWS and Heroku. ALB uses keep-alive whereas Heroku router doesn't.
The router will assume that the client wants to close the connection on each request (no keep-alive).
So theoretically speaking if Heroku started using keep-alive in their router as an enhancement, all apps using puma will perform exponentially worse compared to today :)
And thanks a lot for the engagement @dentarg and for debugging this with me! Really appreciate it! ❤️
When I checked this last (several years ago), Puma's implementation of keep-alive wasn't particularly fair. Basically, if you had 2 workers, and 3 connections (all keep-alive), one of the connections wouldn't get serviced.
You can see the behaviour today using a modified version of wrk
, running with puma -t1
for demonstration purposes:
> ./wrk -c 1 -t 1 -d 1 http://localhost:9292/hello
Running 1s test @ http://localhost:9292/hello
1 threads and 1 connections
closing connection after 40442 requests
connection 0: 40442 requests completed
Thread Stats Avg Stdev Max +/- Stdev
Latency 25.60us 12.42us 404.00us 97.63%
Req/Sec 37.02k 12.26k 41.54k 90.91%
40442 requests in 1.10s, 1.97MB read
Requests/sec: 36753.96
Transfer/sec: 1.79MB
As soon as you do this with 2 connections, the keep-alive recycling logic kicks in:
> ./wrk -c 2 -t 1 -d 1 http://localhost:9292/hello
... snip ...
closing connection after 16720 requests
closing connection after 16730 requests
closing connection after 16730 requests
closing connection after 16740 requests
closing connection after 16740 requests
closing connection after 16750 requests
closing connection after 16750 requests
closing connection after 16760 requests
closing connection after 16760 requests
closing connection after 16770 requests
closing connection after 16770 requests
closing connection after 16780 requests
closing connection after 16780 requests
closing connection after 16790 requests
closing connection after 16790 requests
closing connection after 16800 requests
closing connection after 16800 requests
closing connection after 16810 requests
closing connection after 16810 requests
closing connection after 16820 requests
closing connection after 16820 requests
closing connection after 16830 requests
closing connection after 16830 requests
closing connection after 16840 requests
closing connection after 16840 requests
closing connection after 16850 requests
closing connection after 16850 requests
closing connection after 16851 requests
connection 0: 16851 requests completed
connection 1: 16850 requests completed
Thread Stats Avg Stdev Max +/- Stdev
Latency 98.50us 116.35us 0.89ms 81.10%
Req/Sec 30.85k 10.19k 35.31k 90.91%
33701 requests in 1.10s, 1.70MB read
Requests/sec: 30619.65
Transfer/sec: 1.54MB
That means if you have more connections than worker threads, you will start running into issues. As you can see above, the latency wasn't just 2x worse (for 2x the contention) it was 10x worse, likely due to all the reconnecting going on.
ALB uses keep-alive whereas Heroku router doesn't.
If you are running Puma behind a load balancer, the load balancer should ideally be configured with one keep alive connection per worker thread. This is the absolute best arrangement and additional connections will not help and will in fact make things worse.
If I add
max_fast_inline 0
to the Puma config,hey -n 5000 -c 100 http://127.0.0.1:9292
gives meRequests/sec: 26195.1310
for 2 puma workers with 1 thread each.
queue_requests false
in the config achieves the same (docs says it disables HTTP keep-alive)
I think https://github.com/puma/puma/issues/2311 is the same issue as this
Not sure if this is helpful but there are really only two scenarios you should care about, considering the design of Puma:
1/ Browser facing server: every request should be queued and served in order.
2/ Load balancing facing server: every worker should have one connection to them load balancer.
Connections should not be closed in either case. In (1) you should be internally queueing things so closing the connection should not be necessary for "fast connection handling" and (2) should never overload the number of workers so should never need queueing (the LB should be taking care of that). Every other configuration will, IMHO, be worse for performance/latency/throughput.
Using keep-alive we end up here
In Client#reset
there is this line
https://github.com/puma/puma/blob/6db7c28b069710419cc0a78dbc93c351b222a8e4/lib/puma/client.rb#L185
If I lower this value to 0.002
, I get similar throughput as with -disable-keepalive
https://github.com/puma/puma/blob/6db7c28b069710419cc0a78dbc93c351b222a8e4/lib/puma/const.rb#L108
hey -n 5000 -c 100 http://127.0.0.1:9292
FAST_TRACK_KA_TIMEOUT
was introduced in b9388a6074a816660be0d6dbb49ee802f13507d5
If you are running Puma behind a load balancer, the load balancer should ideally be configured with one keep alive connection per worker thread. This is the absolute best arrangement and additional connections will not help and will in fact make things worse.
Is that even possible to do in AWS environment with application load balancer? I couldn't figure out any setting around number of keep-alive connections.
should never overload the number of workers so should never need queueing (the LB should be taking care of that)
Also not sure about this in AWS context. The ALB will just forward the requests downstream based on the routing algorithm. If queuing is disabled on Puma then the requests reaching the containers will most likely error out?
I think https://github.com/puma/puma/issues/2311 is the same issue as this
Not sure. That one specifically talks about single threaded mode, but we can see in the tests that this doesn't necessarily affect only single thread configurations.
@dentarg I'll investigate your comments around tweaking settings later in the day and get back to you after conducting some tests.
You can see the behaviour today using a modified version of
wrk
@ioquatix Is this is https://github.com/ioquatix/wrk or have you more modifications that makes it print closing connection after ...
? I built your fork and it doesn't print it for me
Ah, sorry, I have a habit of just modifying the wrk
code to add extra output. Let me push a branch with that particular log.
Okay, I added the log: in this branch wrk (connection-close-log)
. LMK if you have trouble getting the same results.
If this is a useful feature maybe we can capture that and report the number of connection close in the final statistics.
Is that even possible to do in AWS environment with application load balancer? I couldn't figure out any setting around number of keep-alive connections.
@amitsaxena As far as I know, ALB doesn't have an interface to configure for the number of keep-alive connections.
I want to share a few ideas to try to tune the AWS infra further:
I'm not sure if Puma allows tuning the max keep-alive timeout. ALB has a (client_keep_alive)[https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb#client_keep_alive], and it might make sense to use a similar value (or two close values) for both Puma and ALB.
Try using least_outstanding_requests
See if you could enable ALB access logs. I'm interested in the various request|target|response_processing_time
values. They might show us what exactly is causing the additional latency
I tried different configurations and below are my observations:
max_fast_inline 0
result in slightly higher latencies than without keep-alive for me but mostly works fine.The code says this about the setting:
The number of requests to attempt inline before sending a client back to the reactor to be subject to normal ordering.
I don't understand well enough what the downsides here can be if we set it to 0.
queue_requests false
works fine for a few short tests, but if I continue running tests for longer duration it either crashes the test or gives me much higher latencies than before. Please see some examples below.My understanding of this setting was that no queuing will happen on puma if this is disabled. Does anyone know what happens if all puma threads are occupied and a request reaches puma? Will it error out as it cannot be queued? My guess is that it's a possible scenario as load balancer will keep sending requests to puma based on routing algorithm and it doesn't really know that puma cannot queue and when to stop sending requests and buffer them at load balancer layer. Or are there other nuances involved here?
If I lower this value to 0.002, I get similar throughput as with -disable-keepalive
Interesting observation! I don't know what the potential downsides here can be of setting this value too low.
Try using least_outstanding_requests
@MehdiZonjy That's what we used during the load tests.
See if you could enable ALB access logs. I'm interested in the various request|target|response_processing_time values. They might show us what exactly is causing the additional latency
We did look at them and the bulk of the time was being spent on the Fargate instance:
request_processing_time
is close to 0 seconds. It is time for the load balancer to receive the request and forward it to the target.target_processing_time
was greater than 4 seconds. It is the time taken by the task to fully process the request.response_processing_time
is close to 0 seconds. It is the time for the load balancer to get the response from the target and relay it back to the client.So we are clearly able to establish that the bulk of the time is being spent on the Fargate container. We also used DataDog to record application performance metrics and both the logs and APM data clearly shows that the average time spent processing the request is in the 2ms range which matches with what we get without keep-alive. So all the above makes us believe that the bulk of time here is spent queuing in puma before the request gets serviced, and the scenario is easily reproducible with keep-alive when hitting the Fargate container directly and skipping the ALB.
So IMHO, all Ruby apps on AWS using the application load balancer are performing much worse than what they are capable of because of this bug. If you are running a Ruby app on AWS with a similar setup it can be worth benchmarking your containers with and without keep-alive to see the difference and share the data with us to validate this finding even further. 🙏
tune alb and webserver max timeouts
I am not certain if this is going to help since the problem here is that puma with keep-alive isn't performant enough with high concurrency, so timeouts don't matter much I think.
@schneems please post your write-up from Slack here
@schneems please post your write-up from Slack here
It was posted at https://github.com/puma/puma/issues/3487#issuecomment-2356401488
@amitsaxena are you able to try out https://github.com/puma/puma/pull/3506?
Sorry for the delay @dentarg. It has been quite busy at work and on the personal front recently.
I can confirm that when re-running the old benchmarks on this puma branch, I am seeing significantly better (~30%) latencies with keep-alive enabled locally. Below are the results.
I can possibly do more comprehensive load tests later on AWS once a newer version of puma with this fix is out as it will be easier to get it past CI/CD pipelines and into our ECS+Fargate setup.
Describe the bug While benchmarking our app against a similar Heroku and AWS stack during an AWS migration we encountered > 20X latencies on AWS compared to what we were getting on Heroku. Our AWS setup consists of Application load balancer + ECS + Fargate. We have ensured there's no CPU or memory contention on the Fargate instance and still Puma is responding to requests in a painfully slow manner when hit with higher concurrency. After several days of debugging we have come to the conclusion that the load balancer keep alive setting is causing the significant differences in latency. We were eventually able to reproduce the issue by hitting the public IP of the container directly with (HTTP/1.1) and without (HTTP/1.0) keep alive. Below are a couple of load tests showing the difference in latencies:
Puma config:
On the Fargate container below is the value of environment variables:
To Reproduce I was able to reproduce the issue on a simple rack app specified in the bug template.
Below are the apache bench results.
Without keep alive:
With keep alive:
You can clearly see that the average latency is exponentially high with HTTP/1.1 or keep alive option.
Expected behavior Comparable latencies with or without keep-alive, or in fact better latencies with keep-alive option.
Desktop (please complete the following information):