the-benchmarker / web-frameworks

Which is the fastest web framework?
MIT License
6.91k stars 641 forks source link

Use all computing units available #802

Closed waghanza closed 3 years ago

waghanza commented 5 years ago

Hi,

This PR aims to implements (or check) whole CPU usage, it closes #69

This feature COULD be :

Regards,

waghanza commented 5 years ago

I'm adding SO_REUSEPORT for ruby's framework https://github.com/puma/puma/pull/1712

dom96 commented 5 years ago

Jester through httpbeast uses SO_REUSEPORT implicitly as long as threads are enabled via --threads:on.

ohler55 commented 5 years ago

Agoo-C uses SO_REUSEPORT as well (src/agoo/bind.c:236)

waghanza commented 5 years ago

@johngcn @kataras are go based frameworks running with SO_REUSEPORT ?

kataras commented 5 years ago

Not by-default, but they can, examples:

gqcn commented 5 years ago

GF's using net/http as its underlying HTTP server currently, and it's a pity that net/http just supports SO_REUSEADDR instead of SO_REUSEPORT.

waghanza commented 5 years ago

@johngcn I thought SO_REUSEPORT was included in stdlib

https://go-review.googlesource.com/c/go/+/37039/

Thanks @kataras for the tip. I think it's better to compare framework here having this feature enable in all impl.

gqcn commented 5 years ago

@waghanza Thank you for the tips. I see, there're some workarounds for implementing SO_REUSEPORT in golang using syscall, in some ARCHs. I'll do some tests in GF using SO_REUSEPORT.

waghanza commented 5 years ago

@ioquatix does falcon use SO_REUSEPORT

ioquatix commented 5 years ago

@waghanza It does now! Use v0.22.0.

falcon serve --reuse-port - it's not the default for obvious reasons.

ioquatix commented 5 years ago

BTW, do you mind explaining what this is for?

I actually used to use SO_REUSEPORT but I found that on Darwin it doesn't work as expected - the OS always uses the first process to bind to the port (but it's not an error to start other processes). The better model is to bind a socket and share it over multiple threads/processes.

ioquatix commented 5 years ago

The problem with SO_REUSEPORT is you might accidentally run multiple (different) app servers on the same port without realising it. I found this a bit annoying to be honest. I'd prefer the server bombs out with an error (can't bind).

waghanza commented 5 years ago

@ioquatix sure, I know this is brake interoperability between OSes, but this project is to test a bunch of frameworks.

To have an equal / fair testing, it is recommended to have the same (at least max we can) behaviour for each implementations (having the same features).

ioquatix commented 5 years ago

To have an equal / fair testing, it is recommended to have the same (at least max we can) behaviour for each implementations (having the same features).

What are you trying to equalise?

Sure you should just let each server maximise processor and memory? How it does that should be up to the server?

For example, falcon can benchmark both --threaded and --forked. For JVM Ruby, you'd need to use --threaded but for MRI you should use --forked (this is what it does by default anyway).

waghanza commented 5 years ago

What are you trying to equalise?

I think it's better to have SO_REUSEPORT everywhere, or no where here ;-)

I'm trying to maximise efficiency (use the maximum of server capacity)

ioquatix commented 5 years ago

So, I agree with your basic idea: make everything equal.

But, unless I'm missing something important, I don't see how SO_REUSEPORT achieves this.

Using --reuse-port with falcon won't change performance I the slightest.

Whether you use that or not, it only binds one port, and then shares it over N processes/threads (N = processor core count).

The only difference is you can start multiple falcon processes bound to the same port, which won't change performance but would make it confusion as to which process will serve which request.

Can you explain what you think using SO_REUSEPORT achieves? Are you planning to start N processes for single-process servers?

waghanza commented 5 years ago

@ioquatix Using SO_REUSEPORT can make a performance increase (and probably reduce resources usage) https://github.com/puma/puma/pull/1712

As far a I understand, SO_REUSEPORT make several process use the same port, so the kernel has not to make some syscall to attach a port to a process, so less syscalls, so less resources used, but @OvermindDL1 will have a better explanation that mine

ioquatix commented 5 years ago

@waghanza what you've said makes no sense at all. Sorry, it simply doesn't align up with what SO_REUSEPORT does.

If anything, SO_REUSEPORT is a crappy way to achieve multi-process or multi-threaded servers. Maybe it's useful for rolling restarts. But it has nothing to do with improving performance of a well designed server (e.g. bind before fork or bind before threads).

Here is the evidence from my testing:

Without --reuse-port:

Running falcon with 128 concurrent connections...
Running 2s test @ http://127.0.0.1:9292/small
  8 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.66ms    2.90ms  38.84ms   92.00%
    Req/Sec     7.59k     2.56k   35.61k    95.65%
  121555 requests in 2.08s, 143.51MB read
Requests/sec:  58520.43
Transfer/sec:     69.09MB

With --reuse-port:

Running falcon with 128 concurrent connections...
Running 2s test @ http://127.0.0.1:9292/small
  8 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.53ms    2.40ms  41.16ms   93.89%
    Req/Sec     7.37k     1.38k   21.34k    93.90%
  120275 requests in 2.10s, 142.00MB read
Requests/sec:  57291.82
Transfer/sec:     67.64MB
waghanza commented 5 years ago

@ioquatix I can see a slight performance increase it req/sec. For latency, I'm not sure wrk is in fact the right tool ;-) see https://github.com/the-benchmarker/web-frameworks/issues/670

ioquatix commented 5 years ago

The req/s actually dropped slightly and it's well within margin of error which would I'd say on any given run would be within 10%.

I agree latency computation is tricky. wrk isn't too bad.

waghanza commented 5 years ago

@ioquatix You get a point, you can suggest a tool to replace wrk if you know one :stuck_out_tongue_winking_eye:

ioquatix commented 5 years ago

I reran my tests under some different benchmarks (big response, small response, tiny response). There was no obvious difference. That being said, if difference is small, it might not be obvious. Honestly, most web server that's capable of more than 50k req/s is fast enough in production. Franky speaking, from experience, as soon as you have any kind of database, it won't matter if you can serve 10k or 100k req/s, the overhead of the application server is so much bigger, the web server overhead becomes irrelevant.

ioquatix commented 5 years ago

Here is a benchmark tool I wrote, but it's limited a bit by the performance of Ruby. It's interesting though because it tries to find the limits of concurrency for a server within a % tolerance of performance degradation: https://github.com/socketry/benchmark-http

The way it works is it makes 1 request, and measures performance/latency.

Then it makes 2 request at the same time.

It does binary search to find the point at which your latency is impacted by a certain %. This is the point at which your server is slowing down multiple requests due to internal contention. It's an interesting metric because it even applies to slow servers over the internet - it's not about absolute latency but how latency is impacted as you increase concurrency.

Sometimes the fast server is most badly affected - it can handle 1 or two concurrent requests very very fast, but as you increase the number of concurrent requests the per-request latency increases sharply. It's a more interesting metric, because when you deal with real world servers on the internet, 1-2ms of latency is irrelevant, but concurrency is more interesting, i.e. how many concurrent requests can the sever handle before it degrades to 20% worse performance per request.

ioquatix commented 5 years ago

Also, don't know if you know this but:

waghanza commented 5 years ago

@ioquatix I was thinking of https://k6.io/

ioquatix commented 5 years ago

That tool looks awesome. That being said, in order for a benchmark tool to be accurate, it must be coded in a low level language IMHO. That's why I don't trust benchmark-http for raw performance metrics. For other things (like concurrency), it's fine.

ohler55 commented 5 years ago

There is also perfer. I wrote it to handle high performance C servers. https://github.com/ohler55/perfer

proyb6 commented 5 years ago

I think this article may be worth reading on various benchmark tools. https://blog.loadimpact.com/open-source-load-testing-tool-benchmarks-v2

waghanza commented 5 years ago

@ohler55 making ad for your own product :stuck_out_tongue:

ohler55 commented 5 years ago

Of course. It works for me so maybe it will for others.

OvermindDL1 commented 5 years ago

@proyb6 Fascinating article!

So wrk adds a touch of latency over apachebench, I didn't experience that here but that was years ago when I tried so it's likely been quite improved in apachebench. ^.^

I find it odd they recommend one of the others for scripting abilities when wrk's luajit handles all that fine as well.

Good to know wrk can still saturate a server best! ^.^

So if anything I'd use both apachebench and wrk, apachebench for testing times (both under light load, heavy load, and extreme load), and use wrk for testing load throughput, based on that benchmark page at least.

There is also perfer. I wrote it to handle high performance C servers. ohler55/perfer

Hmm, let's try it.

First it doesn't seem to support virtualhosts. Second it doesn't seem to support https (big big thing to test on high performance throughput as no server should be running without TLS nowadays).

So doing a quick test without https (which is fine for here since this web-frameworks thing doesn't test it even though it should I'd argue) on a 16 core server with nginx hosting a static page of "ok\n" on /tester:

$ # Curl's time
$ time curl http://127.0.0.1/tester --silent >/dev/null

real    0m0.012s
user    0m0.004s
sys     0m0.008s

$ # ab is single-threaded do note...
$ ab -c 200 -t 10 -n 5000000 -k http://127.0.0.1:80/tester    
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 500000 requests
Completed 1000000 requests
Finished 1309292 requests

Server Software:        nginx/1.15.5
Server Hostname:        127.0.0.1
Server Port:            80

Document Path:          /tester
Document Length:        3 bytes

Concurrency Level:      200
Time taken for tests:   10.001 seconds
Complete requests:      1309292
Failed requests:        0
Keep-Alive requests:    1296294
Total transferred:      226442694 bytes
HTML transferred:       3927879 bytes
Requests per second:    130911.29 [#/sec] (mean)
Time per request:       1.528 [ms] (mean)
Time per request:       0.008 [ms] (mean, across all concurrent requests)
Transfer rate:          22110.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       7
Processing:     0    2   0.8      1      54
Waiting:        0    1   0.7      1      53
Total:          0    2   0.8      1      54
WARNING: The median and mean for the processing time are not within a normal deviation
        These results are probably not that reliable.
WARNING: The median and mean for the total time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      2
  95%      3
  98%      3
  99%      4
 100%     54 (longest request)

$ cd ../wrk && ./wrk -t 8 -c 200 -d 10 http://127.0.0.1:80/tester
Running 10s test @ http://127.0.0.1:80/tester
  8 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.59ms    2.44ms  28.41ms   86.88%
    Req/Sec    37.69k     6.43k   64.67k    68.25%
  3004180 requests in 10.03s, 495.50MB read
Requests/sec: 299614.17
Transfer/sec:     49.42MB

$ cd ../perfer && ./bin/perfer 127.0.0.1:80 --path tester -t 8 -c 20 -k -d 10
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
... spam of the above! Had to Ctrl+c a minute later when it didn't stop

I tried perfer lots of different ways with lots of different options and I could never get it to work. Even in the most basic case:

$ cd ../perfer && ./bin/perfer 127.0.0.1:80 --path tester -t 1 -c 1 -k -d 1
127.0.0.1:80 did not respond to 15 requests.
Benchmarks for:
  URL:                127.0.0.1:80/tester
  Threads:            1
  Connections/thread: 1
  Duration:           1.0 seconds
  Keep-Alive:         true
Results:
  Throughput:         100 requests/second
  Latency:            0.313 +/-0.056 msecs (and stdev)

Only way I could get meaningful information is by disabling it's keepalive:

$ cd ../perfer && ./bin/perfer 127.0.0.1:80 --path tester -t 8 -c 200 -d 10
Benchmarks for:
  URL:                127.0.0.1:80/tester
  Threads:            8
  Connections/thread: 200
  Duration:           10.6 seconds
  Keep-Alive:         false
Results:
  Throughput:         74493 requests/second
  Latency:            10.862 +/-92.472 msecs (and stdev)

I did a number of other tests, it looks like on lower thread counts apachebench outperforms wrk (makes sense since ab is single-threaded) in throughput but wrk was better on latency, and on higher thread counts wrk blows apachebench away in throughput, over double that of ab at 15 threads+ with only a 1.71ms(wrk) compared to 1.50ms(ab).

And since benchmarking should be done on servers running in full multi-threaded multi-process mode, I'd still vote for wrk well over ab. As for perfer, unsure what is wrong with it... I'd like to give it an actual proper shakedown... Everything from ephemeral ports to timeouts to kernel TCP memory to etc... etc... all should be good to test significantly more connections than what I currently tested with (as I've tested with significantly more with both ab and wrk just now with no issue).

Hmm, I have an old 6 core server on a gigabit network (a bit loaded down so not going to get a gigabit) with the 16 core, it's a not terribly empty but should at least serve for a quick test to connect to the 16 core, git clone'ing, building, etc..., and running tests again:

╰─➤  time curl http://192.168.1.89/tester -s>/dev/null
curl http://192.168.1.89/tester -s > /dev/null  0.00s user 0.00s system 60% cpu 0.013 total

╰─➤  cd ../wrk && ./wrk -t 6 -c 200 -d 10 http://192.168.1.89:80/tester
Running 10s test @ http://192.168.1.89:80/tester
  6 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.61ms    2.87ms 220.97ms   87.14%
    Req/Sec     3.80k   444.87    12.44k    95.67%
  227629 requests in 10.09s, 37.55MB read
Requests/sec:  22564.65
Transfer/sec:      3.72MB

╰─➤  ab -c 200 -t 10 -k -n 5000000 http://192.168.1.89:80/tester
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.89 (be patient)
Finished 197286 requests

Server Software:        nginx/1.15.5
Server Hostname:        192.168.1.89
Server Port:            80

Document Path:          /tester
Document Length:        3 bytes

Concurrency Level:      200
Time taken for tests:   10.005 seconds
Complete requests:      197286
Failed requests:        0
Keep-Alive requests:    195455
Total transferred:      34121323 bytes
HTML transferred:       591858 bytes
Requests per second:    19719.47 [#/sec] (mean)
Time per request:       10.142 [ms] (mean)
Time per request:       0.051 [ms] (mean, across all concurrent requests)
Transfer rate:          3330.62 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   6.9      0    1010
Processing:     2   10   3.4     10     229
Waiting:        2   10   3.4     10     228
Total:          2   10   7.7     10    1021

Percentage of the requests served within a certain time (ms)
  50%     10
  66%     11
  75%     11
  80%     12
  90%     13
  95%     14
  98%     15
  99%     17
 100%   1021 (longest request)

╰─➤  cd ../perfer && ./bin/perfer 192.168.1.89:80 --path tester -t 15 -c 200 -d 10 -k
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
... etc...

And this server I used to do my heavy testing on so I know it can handle it. Well, let's try it localhost on this thing (I've managed to hit almost 2 million concurrent connections on this 6-core thing fun enough, though not this time obviously as the connections are less concurrent) with nginx and all:

╰─➤  time curl http://127.0.0.1:8080/tester -s>/dev/null
curl http://127.0.0.1:8080/tester -s > /dev/null  0.00s user 0.00s system 65% cpu 0.012 total

╰─➤  cd ../wrk && ./wrk -t 6 -c 200 -d 10 http://127.0.0.1:8080/tester 
Running 10s test @ http://127.0.0.1:8080/tester
  6 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.41ms    3.04ms  87.36ms   96.37%
    Req/Sec    28.03k     3.49k   40.47k    74.67%
  1678229 requests in 10.08s, 276.80MB read
Requests/sec: 166456.82
Transfer/sec:     27.46MB

╰─➤  ab -c 200 -t 10 -k -n 5000000 http://127.0.0.1:8080/tester 
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 500000 requests
Finished 667517 requests

Server Software:        nginx/1.10.3
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /tester
Document Length:        3 bytes

Concurrency Level:      200
Time taken for tests:   10.003 seconds
Complete requests:      667517
Failed requests:        0
Keep-Alive requests:    660940
Total transferred:      115447724 bytes
HTML transferred:       2002554 bytes
Requests per second:    66734.28 [#/sec] (mean)
Time per request:       2.997 [ms] (mean)
Time per request:       0.015 [ms] (mean, across all concurrent requests)
Transfer rate:          11271.25 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      11
Processing:     0    3   1.8      3     133
Waiting:        0    3   1.7      3     133
Total:          0    3   1.9      3     135

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      3
  75%      3
  80%      3
  90%      3
  95%      4
  98%      6
  99%      8
 100%    135 (longest request)

╰─➤  cd ../perfer && ./bin/perfer 127.0.0.1:8080 --path tester -t 8 -c 200 -d 10 -k
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
*-*-* error sending request: Broken pipe
... etc...

So it doesn't seem to be a configuration on the other server at fault then (wrk did better in every way over ab on this server as well interesting...)...

Since I have ruby installed on this 6-core computer, let's try that http-benchmark thing too:

╰─➤  benchmark-http concurrency http://127.0.0.1:8080/tester                                                                      1 ↵
I am going to benchmark http://127.0.0.1:8080/tester...
I am running 1 asynchronous tasks that will each make sequential requests...
 0.12s: <Async::Task:0x80caf0 failed>
      |  NoMethodError: undefined method `sum' for [0.0003895489498972893, 0.00043974118307232857]:Array
      |  → /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:75 in `average'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:86 in `variance'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:94 in `standard_deviation'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:100 in `standard_error'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:159 in `confident?'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:144 in `sample'
      |    /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/command/concurrency.rb:57 in `block (2 levels) in measure_performance'
      |    /var/lib/gems/2.3.0/gems/async-1.15.1/lib/async/task.rb:199 in `block in make_fiber'
/var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:75:in `average': undefined method `sum' for [0.0003895489498972893, 0.00043974118307232857]:Array (NoMethodError)
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:86:in `variance'
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:94:in `standard_deviation'
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:100:in `standard_error'
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:159:in `confident?'
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/statistics.rb:144:in `sample'
        from /var/lib/gems/2.3.0/gems/benchmark-http-0.5.0/lib/benchmark/http/command/concurrency.rb:57:in `block (2 levels) in measure_performance'
        from /var/lib/gems/2.3.0/gems/async-1.15.1/lib/async/task.rb:199:in `block in make_fiber'

Well, what fresh hell is this horror? o.O

╰─➤  ruby --version
ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu]

Maybe this version of ruby is too old and the program doesn't give a decent error message telling that, let's update with asdf, now it's:

╰─➤  ruby --version  
ruby 2.6.0p0 (2018-12-25 revision 66547) [x86_64-linux]

A month ago, seems recent enough, trying again:

╰─➤  benchmark-http concurrency http://127.0.0.1:8080/tester
I am going to benchmark http://127.0.0.1:8080/tester...
I am running 1 asynchronous tasks that will each make sequential requests...
I made 3037 requests in 1.9s. The per-request latency was 625.026µs. That's 1599.933452566841 asynchronous requests/second.
                  Variance: 0.119µs
        Standard Deviation: 344.367µs
            Standard Error: 6.248831942257431e-06
I am running 2 asynchronous tasks that will each make sequential requests...
I made 3632 requests in 1.2s. The per-request latency was 672.841µs. That's 1486.834535169551 asynchronous requests/second.
                  Variance: 0.164µs
        Standard Deviation: 405.391µs
            Standard Error: 6.726688103003247e-06
I am running 4 asynchronous tasks that will each make sequential requests...
I made 5913 requests in 2.2s. The per-request latency was 1.49ms. That's 1474.910966905804 asynchronous requests/second.
                  Variance: 1.316µs
        Standard Deviation: 1.15ms
            Standard Error: 1.4917858976199e-05
I am running 3 asynchronous tasks that will each make sequential requests...
I made 5893 requests in 2.6s. The per-request latency was 1.33ms. That's 1371.0889319941182 asynchronous requests/second.
                  Variance: 1.046µs
        Standard Deviation: 1.02ms
            Standard Error: 1.3321248721848672e-05
Your server can handle 2 concurrent requests.
At this level of concurrency, requests have ~1.08x higher latency.

Well... that's significantly false... Only 5893 requests in 2.6s, where wrk gets this with both 2 and 20 connections on just two threads, and just for good measure only 1 connection on 1 thread:

╰─➤  cd ../wrk && ./wrk -t 2 -c 2 -d 10 http://127.0.0.1:8080/tester 
Running 10s test @ http://127.0.0.1:8080/tester
  2 threads and 2 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    59.60us  224.36us  13.42ms   99.13%
    Req/Sec    20.15k     2.49k   26.72k    62.38%
  404824 requests in 10.10s, 66.77MB read
Requests/sec:  40082.74
Transfer/sec:      6.61MB

╰─➤  cd ../wrk && ./wrk -t 2 -c 20 -d 10 http://127.0.0.1:8080/tester 
Running 10s test @ http://127.0.0.1:8080/tester
  2 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   404.29us    1.39ms  35.53ms   94.97%
    Req/Sec    77.51k    12.19k   87.50k    91.04%
  1549508 requests in 10.10s, 255.57MB read
Requests/sec: 153415.13
Transfer/sec:     25.30MB

╰─➤  cd ../wrk && ./wrk -t 1 -c 1 -d 10 http://127.0.0.1:8080/tester
Running 10s test @ http://127.0.0.1:8080/tester
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    59.74us   81.40us   4.13ms   97.16%
    Req/Sec    17.66k     1.91k   22.29k    65.35%
  177349 requests in 10.10s, 29.25MB read
Requests/sec:  17559.47
Transfer/sec:      2.90MB

And of course nginx still isn't the fastest thing out but the fact it can be hit 17556 requests a second on a single connection on a single thread and http-benchmark can't even hit 2.5k in a single second with multiple connections is questionable... o.O

ohler55 commented 5 years ago

Wow, nice writeup. Looks like I have some digging to do to figure out what happened with perfer.

OvermindDL1 commented 5 years ago

While I'm at it, let's grab and test k6 too, a quick test with it (20 concurrent connections like at the end of just above, keep alive enabled, etc... etc.., like ab I don't see a way to control threads (ab is single-threaded anyway, but I think go scales across cores so this should already be maximally multi-threaded?)):

╰─➤  k6 run k6.js --duration 10s --vus 20

          /\      |‾‾|  /‾‾/  /‾/                                                                                                     
     /\  /  \     |  |_/  /  / /                                                                                                      
    /  \/    \    |      |  /  ‾‾\                                                                                                    
   /          \   |  |‾\  \ | (_) |                                                                                                   
  / __________ \  |__|  \__\ \___/ .io                                                                                                

  execution: local
     output: -
     script: k6.js

    duration: 10s, iterations: -
         vus: 20,  max: 20

    done [==========================================================] 10s / 10s

    data_received..............: 45 MB  4.5 MB/s
    data_sent..................: 22 MB  2.2 MB/s
    http_req_blocked...........: avg=5.38µs   min=1.11µs  med=1.93µs   max=9.43ms  p(90)=2.81µs   p(95)=3.31µs  
    http_req_connecting........: avg=2.02µs   min=0s      med=0s       max=9.23ms  p(90)=0s       p(95)=0s      
    http_req_duration..........: avg=222.05µs min=47.39µs med=146.05µs max=19.85ms p(90)=406.82µs p(95)=592.62µs
    http_req_receiving.........: avg=18.93µs  min=6.47µs  med=12.11µs  max=14.26ms p(90)=21.54µs  p(95)=28.56µs 
    http_req_sending...........: avg=16.87µs  min=6.25µs  med=9.2µs    max=19.77ms p(90)=18.66µs  p(95)=24.37µs 
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s       max=0s      p(90)=0s       p(95)=0s      
    http_req_waiting...........: avg=186.24µs min=28.53µs med=115.43µs max=12.1ms  p(90)=362.62µs p(95)=523.89µs
    http_reqs..................: 258409 25840.643325/s
    iteration_duration.........: avg=319.61µs min=98.05µs med=225.71µs max=19.91ms p(90)=540.27µs p(95)=794.08µs
    iterations.................: 258405 25840.243329/s
    vus........................: 20     min=20 max=20
    vus_max....................: 20     min=20 max=20

So initially I see it is hitting 25840 req/s where wrk was hitting 153415.13 req/s with the same 20 concurrent connections... Let's try higher:

╰─➤  k6 run k6.js --duration 10s --vus 200

          /\      |‾‾|  /‾‾/  /‾/                                                                                                     
     /\  /  \     |  |_/  /  / /                                                                                                      
    /  \/    \    |      |  /  ‾‾\                                                                                                    
   /          \   |  |‾\  \ | (_) |                                                                                                   
  / __________ \  |__|  \__\ \___/ .io                                                                                                

  execution: local
     output: -
     script: k6.js

    duration: 10s, iterations: -
         vus: 200, max: 200

    done [==========================================================] 10s / 10s

    data_received..............: 40 MB  4.0 MB/s
    data_sent..................: 20 MB  2.0 MB/s
    http_req_blocked...........: avg=15.29µs min=1.2µs    med=2.31µs   max=15.6ms  p(90)=3.3µs   p(95)=3.85µs 
    http_req_connecting........: avg=11.31µs min=0s       med=0s       max=15.1ms  p(90)=0s      p(95)=0s     
    http_req_duration..........: avg=1.38ms  min=55.93µs  med=771.24µs max=34.31ms p(90)=3.21ms  p(95)=4.55ms 
    http_req_receiving.........: avg=23.27µs min=6.81µs   med=11.25µs  max=31.83ms p(90)=21.86µs p(95)=31.97µs
    http_req_sending...........: avg=40.19µs min=6.74µs   med=10.36µs  max=32.8ms  p(90)=20.91µs p(95)=29.24µs
    http_req_tls_handshaking...: avg=0s      min=0s       med=0s       max=0s      p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=1.31ms  min=34.66µs  med=725.4µs  max=33.25ms p(90)=3.1ms   p(95)=4.42ms 
    http_reqs..................: 233824 23378.624065/s
    iteration_duration.........: avg=1.5ms   min=112.08µs med=875.68µs max=40.49ms p(90)=3.4ms   p(95)=4.8ms  
    iterations.................: 233821 23378.324113/s
    vus........................: 200    min=200 max=200
    vus_max....................: 200    min=200 max=200

Wow that took a long time to just load up the engines, guessing making new javascript interpreters is not too fast... >.> Still only about 23378 req/s again.

So far I'm still learning to wrk?

OvermindDL1 commented 5 years ago

Let's try wrk2 while I'm at it as well:

╰─➤  cd ../wrk2 && ./wrk -t 6 -c 20 -d 20 -R 999999 http://127.0.0.1:8080/tester                                                  1 ↵
Running 20s test @ http://127.0.0.1:8080/tester
  6 threads and 20 connections
  Thread calibration: mean lat.: 4472.116ms, rate sampling interval: 15925ms
  Thread calibration: mean lat.: 4414.885ms, rate sampling interval: 15802ms
  Thread calibration: mean lat.: 4450.223ms, rate sampling interval: 15810ms
  Thread calibration: mean lat.: 4418.796ms, rate sampling interval: 15843ms
  Thread calibration: mean lat.: 4394.733ms, rate sampling interval: 15654ms
  Thread calibration: mean lat.: 4511.955ms, rate sampling interval: 15908ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    13.27s     2.49s   17.38s    58.47%
    Req/Sec       -nan      -nan   0.00      0.00%
  2670362 requests in 20.00s, 440.44MB read
Requests/sec: 133524.10
Transfer/sec:     22.02MB

And normal wrk with the same settings:

╰─➤  cd ../wrk && ./wrk -t 6 -c 20 -d 20 http://127.0.0.1:8080/tester
Running 20s test @ http://127.0.0.1:8080/tester
  6 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   406.73us    1.10ms  30.97ms   92.55%
    Req/Sec    27.15k     4.04k   38.84k    68.08%
  3243485 requests in 20.02s, 534.97MB read
Requests/sec: 162018.91
Transfer/sec:     26.72MB

I'm thinking wrk2 is a bit broken... How can it have multiple-second latency as average with 133524.10 req/s... o.O

OvermindDL1 commented 5 years ago

Tested vegeta as well, managed to get it up to 39250.47 req/sec before it crumbled under the load (process froze for multiple minutes, had to kill the process).

OvermindDL1 commented 5 years ago

Tested welle too:

╰─➤  ./target/release/welle -c 20 -n 500000 http://127.0.0.1:8080/tester
Total Requests: 500000
Concurrency Count: 20
Total Completed Requests: 500000
Total Errored Requests: 0
Total 5XX Requests: 0

Total Time Taken: 21.564045318s
Avg Time Taken: 43.128µs
Total Time In Flight: 248.762968408s
Avg Time In Flight: 497.525µs

Percentage of the requests served within a certain time:
50%: 672.936µs
66%: 730.705µs
75%: 779.781µs
80%: 820.002µs
90%: 977.444µs
95%: 1.168457ms
99%: 2.119692ms
100%: 47.040756ms

So it averaged ~23186.75 req/sec, which is still well well below the ~133506.72 req/sec from wrk with the same settings (or ~82639.49 req/sec with only a single thread with wrk or ~66755.43 req/sec with apachebench)...

waghanza commented 5 years ago

@OvermindDL1 wrk seems to be accurate for req/s, but what for latency ?

OvermindDL1 commented 5 years ago

@OvermindDL1 wrk seems to be accurate for req/s, but what for latency ?

Seems fairly accurate from the tests I did. Other tools give more useful information for lower concurrency amounts but nothing I tested today was able to get anywhere near wrk on higher concurrent counts without either dying off or exploding to very high times.

Some samples I just ran again here from the 6-core to the 16-core (to add additional latency for testing):

Wrk:

╰─➤  cd ../wrk && ./wrk -t 1 -c 20 -d 20 http://192.168.1.89:80/tester  
Running 20s test @ http://192.168.1.89:80/tester
  1 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.85ms  242.26us   5.04ms   74.15%
    Req/Sec    22.70k   460.65    23.94k    71.64%
  453954 requests in 20.10s, 74.87MB read
Requests/sec:  22584.78
Transfer/sec:      3.73MB

K6:

╰─➤  k6 run k6.js --duration 10s --vus 200

          /\      |‾‾|  /‾‾/  /‾/                                                                                                     
     /\  /  \     |  |_/  /  / /                                                                                                      
    /  \/    \    |      |  /  ‾‾\                                                                                                    
   /          \   |  |‾\  \ | (_) |                                                                                                   
  / __________ \  |__|  \__\ \___/ .io                                                                                                

  execution: local
     output: -
     script: k6.js

    duration: 10s, iterations: -
         vus: 200, max: 200

    done [==========================================================] 10s / 10s

    data_received..............: 32 MB  3.2 MB/s
    data_sent..................: 16 MB  1.6 MB/s
    http_req_blocked...........: avg=140.12µs min=1.21µs   med=2.37µs  max=1s       p(90)=3.37µs  p(95)=4.16µs 
    http_req_connecting........: avg=136.24µs min=0s       med=0s      max=1s       p(90)=0s      p(95)=0s     
    http_req_duration..........: avg=7.29ms   min=225.33µs med=7.45ms  max=231.34ms p(90)=10.48ms p(95)=11.38ms
    http_req_receiving.........: avg=23.92µs  min=8.22µs   med=14.23µs max=18.19ms  p(90)=27.55µs p(95)=42.9µs 
    http_req_sending...........: avg=29.6µs   min=6.95µs   med=11.77µs max=43.43ms  p(90)=22.12µs p(95)=29.79µs
    http_req_tls_handshaking...: avg=0s       min=0s       med=0s      max=0s       p(90)=0s      p(95)=0s     
    http_req_waiting...........: avg=7.23ms   min=169.64µs med=7.41ms  max=231.27ms p(90)=10.43ms p(95)=11.32ms
    http_reqs..................: 185679 18567.762706/s
    iteration_duration.........: avg=7.54ms   min=343.66µs med=7.57ms  max=1.01s    p(90)=10.72ms p(95)=11.69ms
    iterations.................: 185679 18567.762706/s
    vus........................: 200    min=200 max=200
    vus_max....................: 200    min=200 max=200

ApacheBench:

╰─➤  ab -c 200 -t 20 -k -n 5000000 http://192.168.1.89:80/tester
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.89 (be patient)
Finished 392775 requests

Server Software:        nginx/1.15.5
Server Hostname:        192.168.1.89
Server Port:            80

Document Path:          /tester
Document Length:        3 bytes

Concurrency Level:      200
Time taken for tests:   20.000 seconds
Complete requests:      392775
Failed requests:        0
Keep-Alive requests:    389001
Total transferred:      67931373 bytes
HTML transferred:       1178328 bytes
Requests per second:    19638.35 [#/sec] (mean)
Time per request:       10.184 [ms] (mean)
Time per request:       0.051 [ms] (mean, across all concurrent requests)
Transfer rate:          3316.89 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0  10.3      0    1013
Processing:     1   10   4.0     10     231
Waiting:        1   10   4.0     10     231
Total:          2   10  11.1     10    1024

Percentage of the requests served within a certain time (ms)
  50%     10
  66%     11
  75%     11
  80%     12
  90%     13
  95%     14
  98%     15
  99%     17
 100%   1024 (longest request)

Welle:

╰─➤  ./target/release/welle -c 20 -n 500000 http://192.168.1.89:80/tester
Total Requests: 500000
Concurrency Count: 20
Total Completed Requests: 500000
Total Errored Requests: 0
Total 5XX Requests: 0

Total Time Taken: 28.261249948s
Avg Time Taken: 56.522µs
Total Time In Flight: 442.984907521s
Avg Time In Flight: 885.969µs

Percentage of the requests served within a certain time:
50%: 1.2383ms
66%: 1.388187ms
75%: 1.495586ms
80%: 1.570883ms
90%: 1.806927ms
95%: 2.068901ms
99%: 2.874676ms
100%: 60.555964ms

Base ping:

╰─➤  ping 192.168.1.89 -c 2
PING 192.168.1.89 (192.168.1.89) 56(84) bytes of data.
64 bytes from 192.168.1.89: icmp_seq=1 ttl=64 time=0.496 ms
64 bytes from 192.168.1.89: icmp_seq=2 ttl=64 time=0.492 ms

--- 192.168.1.89 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.492/0.494/0.496/0.002 ms

Overall wrk was the most consistent and shortest, giving 0.85ms 242.26us 5.04ms 74.15%, which is an average of .45ms above base ping, and k6 had an average of 7.54ms with just the http waiting part at 7.29ms average, which is significantly higher than what both wrk and ab show, and ab had 10.184ms for non-concurrent requests and 0.051ms for concurrent requests, and I'm not sure where it is getting that from (aggregated? if so then the actual would be the value times the concurrency amount thus 0.051*20 is 1.02ms, which is still not enough, maybe its taking into account the ping already?). Welle had 1.2383ms for its 50%, which is about double ping time.

ab might be the most accurate at that level depending on what is being calculated, though wrk was the direct-listed fastest as well as pumped through the most requests.

ioquatix commented 5 years ago

Thanks for the detailed write up, there is a bug/some unfortunate upstream changes which are affecting benchmark-http - I'll see if I can sort it out.

ioquatix commented 5 years ago

Okay, found the issue. But I have a school meeting with my kids. I will fix it later.

OvermindDL1 commented 5 years ago

Okay, found the issue. But I have a school meeting with my kids. I will fix it later.

Awesome! Tell me when updated and what commands I need to run to update it locally (I'm not a ruby user so I'm unfamiliar with its ecosystem) and I'll test the 2 servers again. :-)

ioquatix commented 5 years ago

Okay, I released benchmark-http 0.6.0 and I also set minimum ruby version (2.4). However, for good performance, run it on Linux.

ioquatix commented 5 years ago

I also need to do some more perf comparisons with wrk to see how it can be better. On my laptop wrk was getting about 8000 req/s but benchmark-http was around 5000.

waghanza commented 5 years ago

@ioquatix a laptop generaly have not the same CPU (a server is mainly a Xeon)

OvermindDL1 commented 5 years ago

Even on a laptop that is abysmal and probably not wrk's fault, but rather what is the server being tested?

If you are testing, say, a ruby server then it will be abysmal in general as that's just the nature of ruby, regardless of the tool you use. You need to test a fast server, and even nginx is not the fastest (but sufficient in testing here to see what can saturate it) but will be significantly faster than anything else 99.99%+ of people will ever use. So, testing... :-)

╰─➤  benchmark-http concurrency http://127.0.0.1:8080/tester
I am going to benchmark http://127.0.0.1:8080/tester...
I am running 1 asynchronous tasks that will each make sequential requests...
I made 3305 requests in 865.48ms. The per-request latency was 261.869µs. That's 3818.7074071129277 asynchronous requests/second.
                  Variance: 0.023µs
        Standard Deviation: 150.522µs
            Standard Error: 2.618263603678451e-06
I am running 2 asynchronous tasks that will each make sequential requests...
I made 7 requests in 828.371µs. The per-request latency was 236.677µs. That's 7266.325628609022 asynchronous requests/second.
                  Variance: 0.000µs
        Standard Deviation: 5.398µs
            Standard Error: 2.04042306253398e-06
I am running 4 asynchronous tasks that will each make sequential requests...
I made 1033 requests in 189.02ms. The per-request latency was 731.914µs. That's 4916.925140612694 asynchronous requests/second.
                  Variance: 0.055µs
        Standard Deviation: 234.489µs
            Standard Error: 7.295796636094857e-06
I am running 3 asynchronous tasks that will each make sequential requests...
I made 1328 requests in 255.88ms. The per-request latency was 578.041µs. That's 4423.081937596428 asynchronous requests/second.
                  Variance: 0.044µs
        Standard Deviation: 210.252µs
            Standard Error: 5.769536262398735e-06
Your server can handle 2 concurrent requests.
At this level of concurrency, requests have ~0.9x higher latency.

That was after multiple attempts I chose the best result. It's not consistent, I.E. gives wildly differing results each time, for example here was the worst result:

╰─➤  benchmark-http concurrency http://127.0.0.1:8080/tester
I am going to benchmark http://127.0.0.1:8080/tester...
I am running 1 asynchronous tasks that will each make sequential requests...
I made 4006 requests in 1.1s. The per-request latency was 268.569µs. That's 3723.4377404288166 asynchronous requests/second.
                  Variance: 0.029µs
        Standard Deviation: 169.958µs
            Standard Error: 2.6852608427339045e-06
I am running 2 asynchronous tasks that will each make sequential requests...
I made 1596 requests in 297.92ms. The per-request latency was 373.329µs. That's 4091.541043575783 asynchronous requests/second.
                  Variance: 0.022µs
        Standard Deviation: 149.025µs
            Standard Error: 3.7302991364257682e-06
Your server can handle 1 concurrent requests.
At this level of concurrency, requests have ~1.0x higher latency.

And it's values are about on par from what I'm used to seeing in Ruby apps. Here's wrk again just now with just a single connection to use as a comparison baseline:

╰─➤  cd ../wrk && ./wrk -t 1 -c 1 -d 4 http://127.0.0.1:8080/tester
Running 4s test @ http://127.0.0.1:8080/tester
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    54.69us   54.96us   2.34ms   96.47%
    Req/Sec    18.59k     2.26k   23.27k    68.29%
  75708 requests in 4.10s, 12.49MB read
Requests/sec:  18468.65
Transfer/sec:      3.05MB

And here's a fully saturating run:

╰─➤  cd ../wrk && ./wrk -t 5 -c 60 -d 4 http://127.0.0.1:8080/tester
Running 4s test @ http://127.0.0.1:8080/tester
  5 threads and 60 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.30ms    5.31ms  83.51ms   96.82%
    Req/Sec    32.34k     5.87k   47.96k    84.50%
  644077 requests in 4.03s, 106.23MB read
Requests/sec: 159790.94
Transfer/sec:     26.36MB

And here's apachebench (a single-threaded app):

╰─➤  ab -c 60 -t 4 -k -n 5000000 http://127.0.0.1:8080/tester     
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Finished 269704 requests

Server Software:        nginx/1.10.3
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /tester
Document Length:        3 bytes

Concurrency Level:      60
Time taken for tests:   4.000 seconds
Complete requests:      269704
Failed requests:        0
Keep-Alive requests:    267038
Total transferred:      46645798 bytes
HTML transferred:       809118 bytes
Requests per second:    67423.07 [#/sec] (mean)
Time per request:       0.890 [ms] (mean)
Time per request:       0.015 [ms] (mean, across all concurrent requests)
Transfer rate:          11387.64 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       4
Processing:     0    1   1.0      1      67
Waiting:        0    1   1.0      1      67
Total:          0    1   1.0      1      67

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      2
  98%      2
  99%      3
 100%     67 (longest request)

I just don't think that Ruby is capable of such testing, nor any other interpreted language. You'd need to write benchmark-http in either C, C++, Go, or Rust, maybe Java or .NET Core (but then you have to take into account JIT warmup time) to have it be accurate at all. And don't forget to use keepalive as most HTTP1 browsers (and all HTTP2) will be using it (though many, but not all, API client services don't use it oddly), otherwise you are mostly testing the OS TCP Networking stack startup time. But as it stands I think benchmark-http is mostly testing Ruby's own throughput right now.

ioquatix commented 5 years ago

Thanks for all the details.

I agree, should rewrite in a compiled language.

If you run it against a real web server over the network, I think you'll generally find you get better consistency.

That being said, I agree with your conclusions.

waghanza commented 5 years ago

@ioquatix could take crystal :stuck_out_tongue:

OvermindDL1 commented 5 years ago

@waghanza Crystal is also not parallel sadly, and has a GC that will corrupt the results with occasional variance. ^.^; Though it is a lot easier to do concurrency in it, that doesn't help with parallel work or the reliability of no GC.

That being said, if you run it against a real web server over the network, I think you'll generally find you get better consistency.

The results are much worse when I do it that way... ^.^

╰─➤  benchmark-http concurrency http://192.168.1.89/tester                                                                        1 ↵
I am going to benchmark http://192.168.1.89/tester...
I am running 1 asynchronous tasks that will each make sequential requests...
I made 2006 requests in 1.3s. The per-request latency was 623.476µs. That's 1603.9119558469351 asynchronous requests/second.
                  Variance: 0.078µs
        Standard Deviation: 279.171µs
            Standard Error: 6.233101459918445e-06
I am running 2 asynchronous tasks that will each make sequential requests...
I made 1785 requests in 428.33ms. The per-request latency was 479.919µs. That's 3218.6768911803174 asynchronous requests/second.
                  Variance: 0.041µs
        Standard Deviation: 202.625µs
            Standard Error: 4.7959525300267085e-06
I am running 4 asynchronous tasks that will each make sequential requests...
I made 1251 requests in 249.68ms. The per-request latency was 798.341µs. That's 4470.735167561054 asynchronous requests/second.
                  Variance: 0.079µs
        Standard Deviation: 281.468µs
            Standard Error: 7.957940519694209e-06
I am running 3 asynchronous tasks that will each make sequential requests...
I made 1751 requests in 433.90ms. The per-request latency was 743.398µs. That's 3392.178350844571 asynchronous requests/second.
                  Variance: 0.097µs
        Standard Deviation: 310.650µs
            Standard Error: 7.423843534806914e-06
Your server can handle 3 concurrent requests.
At this level of concurrency, requests have ~1.19x higher latency.

So it reports 3392 req/s, and yet:

╰─➤  cd ../wrk && ./wrk -t 5 -c 20 -d 4 http://192.168.1.89/tester  
Running 4s test @ http://192.168.1.89/tester
  5 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.88ms  284.67us   7.75ms   78.91%
    Req/Sec     4.52k   139.72     4.87k    70.73%
  92185 requests in 4.10s, 15.20MB read
Requests/sec:  22483.60
Transfer/sec:      3.71MB

22483.60 req/s here. It also reports a significantly higher latency than wrk does as well (there is no way a remote server with <0.6ms ping is responding as slow as 433.90ms as benchmark-http is reporting, wrk reports 0.88ms response average).

In fact, just for comparison, let me whip up a quick http-call loop in Elixir (another interpreted language that is well known for its poor math performance, though decent IO, a purely immutably functional language, I.E. lots of memory creation and other inefficiencies), done and here's a result from the REPL:

iex(5)> BenchmarkHttp.CLI.main(["concurrency", "-d", "5", "-c", "20", "http://192.168.1.89/tester"])

Request Results
        Server:         ["nginx/1.15.5 (Ubuntu)"]
        Date:           ["Tue, 29 Jan 2019 20:47:14 GMT"]
        Content-Length: ["3"]
        Content-Type:   ["application/octet-stream"]

Result:
        Success: 61628
        Failed: 0
        Time: 5s
        Req/s: 12325.6

%{concurrency: 20, duration: 5, failed: 0, succeeded: 61628}

It's reporting (12325 req/s) about what half of wrk is and many times that of benchmark-http and it's nothing but a trivial loop of http requests across work units of count as specificed by -c that I know I could significantly optimize if I really tried to (like the client library is even doing things like parsing and verifying headers, parsing the body based on the content type, use more efficient OS time calls since I don't need microsecond resolution just for counting, etc... etc..., all in language and not slaved out to native plugins, I.E. it is not even remotely efficient or optimized).

Hmm, let's also test one my cheapy VM servers that's in another country from here with high latency, I.E. over specifically laggy Internet instead of across just an Intranet network:

╰─➤  benchmark-http concurrency http://my-remote-server/
I am going to benchmark http://my-remote-server...
I am running 1 asynchronous tasks that will each make sequential requests...
I made 8 requests in 522.86ms. The per-request latency was 65.36ms. That's 15.300401576435712 asynchronous requests/second.
                  Variance: 3.263µs
        Standard Deviation: 1.81ms
            Standard Error: 0.000638640071178552
I am running 2 asynchronous tasks that will each make sequential requests...
I made 16 requests in 524.51ms. The per-request latency was 65.56ms. That's 29.34524830772133 asynchronous requests/second.
                  Variance: 6.009µs
        Standard Deviation: 2.45ms
            Standard Error: 0.0006128254314687405
I am running 4 asynchronous tasks that will each make sequential requests...
I made 24 requests in 389.90ms. The per-request latency was 64.98ms. That's 53.49286001831672 asynchronous requests/second.
                  Variance: 7.789µs
        Standard Deviation: 2.79ms
            Standard Error: 0.0005696812234706188
I am running 8 asynchronous tasks that will each make sequential requests...
I made 18 requests in 151.20ms. The per-request latency was 67.20ms. That's 88.97398184850658 asynchronous requests/second.
                  Variance: 5.040µs
        Standard Deviation: 2.24ms
            Standard Error: 0.0005291371145748797
I am running 16 asynchronous tasks that will each make sequential requests...
I made 39 requests in 167.54ms. The per-request latency was 68.74ms. That's 181.81855304384172 asynchronous requests/second.
                  Variance: 9.581µs
        Standard Deviation: 3.10ms
            Standard Error: 0.00049565301224643
I am running 32 asynchronous tasks that will each make sequential requests...
I made 64 requests in 136.97ms. The per-request latency was 68.48ms. That's 331.36589476486455 asynchronous requests/second.
                  Variance: 3.857µs
        Standard Deviation: 1.96ms
            Standard Error: 0.00024549164702755406
I am running 64 asynchronous tasks that will each make sequential requests...
I made 128 requests in 152.12ms. The per-request latency was 76.06ms. That's 582.0211726381192 asynchronous requests/second.
                  Variance: 31.918µs
        Standard Deviation: 5.65ms
            Standard Error: 0.0004993577118221741
I am running 128 asynchronous tasks that will each make sequential requests...
I made 256 requests in 146.04ms. The per-request latency was 73.02ms. That's 690.6692888613667 asynchronous requests/second.
                  Variance: 24.645µs
        Standard Deviation: 4.96ms
            Standard Error: 0.0003102709973866521
I am running 256 asynchronous tasks that will each make sequential requests...
I made 512 requests in 149.57ms. The per-request latency was 74.78ms. That's 1065.4000351261438 asynchronous requests/second.
                  Variance: 44.591µs
        Standard Deviation: 6.68ms
            Standard Error: 0.0002951131485336915
I am running 512 asynchronous tasks that will each make sequential requests...
I made 22683 requests in 42.1s. The per-request latency was 950.95ms. That's 336.34093466912407 asynchronous requests/second.
                  Variance: 1.9s
        Standard Deviation: 1.4s
            Standard Error: 0.009160851729160535
I am running 384 asynchronous tasks that will each make sequential requests...
I made 5208 requests in 1.8s. The per-request latency was 132.90ms. That's 1673.9411117341976 asynchronous requests/second.
                  Variance: 8.30ms
        Standard Deviation: 91.11ms
            Standard Error: 0.0012625107156406822
I am running 320 asynchronous tasks that will each make sequential requests...
I made 1203 requests in 283.42ms. The per-request latency was 75.39ms. That's 1096.3431941223046 asynchronous requests/second.
                  Variance: 579.268µs
        Standard Deviation: 24.07ms
            Standard Error: 0.0006939162140022423
I am running 352 asynchronous tasks that will each make sequential requests...
I made 3517 requests in 841.37ms. The per-request latency was 84.21ms. That's 1520.76700494723 asynchronous requests/second.
                  Variance: 2.36ms
        Standard Deviation: 48.54ms
            Standard Error: 0.0008184125161427597
I am running 336 asynchronous tasks that will each make sequential requests...
I made 2033 requests in 537.54ms. The per-request latency was 88.84ms. That's 2050.186508796951 asynchronous requests/second.
                  Variance: 1.43ms
        Standard Deviation: 37.88ms
            Standard Error: 0.0008400212237622703
I am running 328 asynchronous tasks that will each make sequential requests...
I made 44480 requests in 173.1s. The per-request latency was 1.3s. That's 236.91531854386741 asynchronous requests/second.
                  Variance: 7.1s
        Standard Deviation: 2.7s
            Standard Error: 0.012616618260588743
I am running 324 asynchronous tasks that will each make sequential requests...
I made 1596 requests in 402.33ms. The per-request latency was 81.68ms. That's 1676.7053903668725 asynchronous requests/second.
                  Variance: 974.471µs
        Standard Deviation: 31.22ms
            Standard Error: 0.0007813901938619705
I am running 322 asynchronous tasks that will each make sequential requests...
I made 1001 requests in 227.08ms. The per-request latency was 73.05ms. That's 448.50127484693235 asynchronous requests/second.
                  Variance: 520.052µs
        Standard Deviation: 22.80ms
            Standard Error: 0.0007207860050025015
I am running 323 asynchronous tasks that will each make sequential requests...
I made 646 requests in 153.12ms. The per-request latency was 76.56ms. That's 696.2517050748319 asynchronous requests/second.
                  Variance: 261.429µs
        Standard Deviation: 16.17ms
            Standard Error: 0.0006361518703546547
Your server can handle 323 concurrent requests.
At this level of concurrency, requests have ~1.17x higher latency.

That took almost a half-hour to run!!! o.O! So it says at 323 it can do 696, yet at 324 was it's highest at 1676. Can definitely see where the GC in it is adding a lot of variance for sure!.

With wrk with 324 concurrency too:

╰─➤  cd ../wrk && ./wrk -t 5 -c 324 -d 4 http://my-remote-server/
Running 4s test @ http://my-remote-server/
  5 threads and 324 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    76.23ms   28.27ms 521.40ms   97.82%
    Req/Sec   816.07    168.01     1.09k    75.76%
  16143 requests in 4.07s, 6.53MB read
Requests/sec:   3964.72
Transfer/sec:      1.60MB

wrk does 3964.72 (I'm sure my poor upload to the server is contributing to these issues... And via my http get loop with 324 concurrency as well:

iex(8)> BenchmarkHttp.CLI.main(["concurrency", "-d", "5", "-c", "324", "http://my-remote-server/"]) 

Request Results
        Server:         ["nginx/1.14.0 (Ubuntu)"]
        Date:           ["Tue, 29 Jan 2019 21:16:33 GMT"]
        Content-Length: ["196"]
        Content-Type:   ["text/html"]

Result:
        Success: 18034
        Failed: 0
        Time: 5s
        Req/s: 3606.8

%{concurrency: 324, duration: 5, failed: 0, succeeded: 18034}

Almost as fast as wrk. Of course there are the usual internet issues that blows reliable testing completely away so this is expected variance.

Just for fun let's bump my trivial loop to, oh, 2000 concurrent loops for 5 seconds:

iex(9)> BenchmarkHttp.CLI.main(["concurrency", "-d", "5", "-c", "2000", "-s", "308", "http://my-remote-server/tester"])

Request Results
        Server:         ["nginx/1.14.0 (Ubuntu)"]
        Date:           ["Tue, 29 Jan 2019 21:17:22 GMT"]
        Content-Length: ["196"]
        Content-Type:   ["text/html"]

Result:
        Success: 24667
        Failed: 0
        Time: 5s
        Req/s: 4933.4

%{concurrency: 2000, duration: 5, failed: 0, succeeded: 24667}

Still even higher, even much more so than wrk at 324, so benchmark-http did not find the optimal concurrency level. wrk at 2000 as well is:

╰─➤  cd ../wrk && ./wrk -t 5 -c 2000 -d 5 http://my-remote-server/
Running 5s test @ http://my-remote-server/
  5 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   140.73ms  190.73ms   1.94s    86.09%
    Req/Sec     0.99k   313.39     1.63k    65.85%
  24461 requests in 5.08s, 9.89MB read
  Socket errors: connect 0, read 0, write 0, timeout 44
Requests/sec:   4818.76
Transfer/sec:      1.95MB

Also much higher, about on par with my simple request loops but still well within the bounds of internet random variance (my loop should be slower than wrk, by a good margin).

Let's also try ab at 324 and 2000 as well just for more data points:

╰─➤  ab -c 324 -t 5 -n 5000000 http://my-remote-server/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking my-remote-server (be patient)
Finished 3214 requests

Server Software:        nginx/1.14.0
Server Hostname:        my-remote-server
Server Port:            80

Document Path:          /
Document Length:        196 bytes

Concurrency Level:      324
Time taken for tests:   5.293 seconds
Complete requests:      3214
Failed requests:        0
Non-2xx responses:      3215
Total transferred:      1347085 bytes
HTML transferred:       630140 bytes
Requests per second:    607.22 [#/sec] (mean)
Time per request:       533.581 [ms] (mean)
Time per request:       1.647 [ms] (mean, across all concurrent requests)
Transfer rate:          248.54 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       63  201 424.0     75    3085
Processing:    65  154 248.2     78    2430
Waiting:       65  136 218.9     78    2430
Total:        131  355 523.6    155    3657

Percentage of the requests served within a certain time (ms)
  50%    155
  66%    161
  75%    169
  80%    383
  90%   1141
  95%   1285
  98%   2161
  99%   3156
 100%   3657 (longest request)

╰─➤  ab -c 2000 -t 5 -n 5000000 http://my-remote-server
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking my-remote-server (be patient)
Finished 679 requests

Server Software:        nginx/1.14.0
Server Hostname:        my-remote-server
Server Port:            80

Document Path:          /
Document Length:        196 bytes

Concurrency Level:      2000
Time taken for tests:   5.172 seconds
Complete requests:      679
Failed requests:        0
Non-2xx responses:      679
Total transferred:      284501 bytes
HTML transferred:       133084 bytes
Requests per second:    131.30 [#/sec] (mean)
Time per request:       15232.742 [ms] (mean)
Time per request:       7.616 [ms] (mean, across all concurrent requests)
Transfer rate:          53.72 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       64  950 1253.0     91    3113
Processing:    65  145 376.2     73    3231
Waiting:       65  137 373.1     73    3231
Total:        131 1095 1310.0    175    3807

Percentage of the requests served within a certain time (ms)
  50%    175
  66%   1153
  75%   3135
  80%   3146
  90%   3176
  95%   3286
  98%   3795
  99%   3803
 100%   3807 (longest request)

You can really see how badly ab's single-core model hurts it here, both in throughput and latency! O.o!

ioquatix commented 5 years ago

First, let me explain what this means:

I am running 323 asynchronous tasks that will each make sequential requests...
I made 646 requests in 153.12ms. The per-request latency was 76.56ms. That's 696.2517050748319 asynchronous requests/second.
                  Variance: 261.429µs
        Standard Deviation: 16.17ms
            Standard Error: 0.0006361518703546547
Your server can handle 323 concurrent requests.
At this level of concurrency, requests have ~1.17x higher latency.

It's actually about the same (similar average latency, similar standard deviation):

╰─➤  cd ../wrk && ./wrk -t 5 -c 324 -d 4 http://my-remote-server/
Running 4s test @ http://my-remote-server/
  5 threads and 324 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    76.23ms   28.27ms 521.40ms   97.82%
    Req/Sec   816.07    168.01     1.09k    75.76%
  16143 requests in 4.07s, 6.53MB read
Requests/sec:   3964.72
Transfer/sec:      1.60MB

I made 646 requests in 153.12ms

646 / 0.15312 = ~4200 req/s across 323 connections.

That's 696.2517050748319 asynchronous requests/second.

This is total number of requests / wall-clock time (which isn't given explicitly unfortunately).

This takes into account all overheads, connection setup, etc. It's trying to state more realistically what you'd get.

... [regarding higher concurrency]

When you increased the level of concurrency, your latency went through the roof:

324 concurrent connections:

Latency 76.23ms 28.27ms 521.40ms 97.82%

2000 concurrent connections:

Latency 140.73ms 190.73ms 1.94s 86.09%

The point of benchmark-http is to find the point at which increasing connections increases latency. You can control how tightly it tries to find this metric, and yes it can take a long time because it tries to account for variance in the network connection during the test by constraining standard error during benchmarking. If your network has issues, it will take much longer for the results to settle.

benchmark-http [--verbose | --quiet] [-h/--help] [-v/--version] <command>
    An asynchronous HTTP server benchmark.

    [--verbose | --quiet]  Verbosity of output for debugging.
    [-h/--help]            Print out help information.       
    [-v/--version]         Print out the application version.
    <command>              One of: concurrency, spider.      

    concurrency [-t/--threshold <factor>] [-c/--confidence <factor>] <hosts...>
        Determine the optimal level of concurrency.

        [-t/--threshold <factor>]   The acceptable latency penalty when making concurrent requests                      Default: 1.2 
        [-c/--confidence <factor>]  The confidence required when computing latency (lower is less reliable but faster)  Default: 0.99
        <hosts...>                  One or more hosts to benchmark                                                    

You can make it run more quickly by specifying --confidence 0.9 or even lower. But the results won't be as stable.

You can loosen the bounds of the search by specifying --threshold 1.5, that would allow latency to get worse by up to 50% when increasing concurrency.

In your example with -c 2000, your latency was at least 2x worse. That's the point - you can increase the number of connections, but you sacrifice latency - or in other words, the server is responding but it's much slower because it's bottle necked.

ioquatix commented 5 years ago

The wall clock time is pretty confusing, so I'm going to rework the output a bit to make this clearer.