Tide without TCP_NODELAY performs unfavorably in benchmarks

http-rs / tide

Fast and friendly HTTP server framework for async Rust

https://docs.rs/tide

Apache License 2.0

5.05k stars 322 forks source link

Tide without TCP_NODELAY performs unfavorably in benchmarks #814

Open wisonye opened 3 years ago

wisonye commented 3 years ago

I wrote 2 performance testing minimal HTTP servers, one for tide (1.6) and one for Node.JS, what I expect is that the tide one should faster than the Node.JS one. But the result is the opposite..., anyone has a try on this?:)

Testing environment:

macOS 10.14.6
Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz [ 6 cores ]
rustc 1.50.0 (cb75ad5db 2021-02-10)

Here are the source code for both versions:

Both HTTP server have the same routes and responses below:

HTML response (Benchmark testing) for the default route /:

JSON response for the /json-benchmark route:

{
    "name":"Wison Ye",
    "role":"Administrator",
    "settings":{
        "prefer_language":"English",
        "reload_when_changed":true
    }
}

Here is my ulimit -a output:

Maximum size of core files created                           (kB, -c) 0
Maximum size of a processճ data segment                     (kB, -d) unlimited
Maximum size of files created by the shell                   (kB, -f) unlimited
Maximum size that may be locked into memory                  (kB, -l) unlimited
Maximum resident set size                                    (kB, -m) unlimited
Maximum number of open file descriptors                          (-n) 1000000
Maximum stack size                                           (kB, -s) 8192
Maximum amount of cpu time in seconds                   (seconds, -t) unlimited
Maximum number of processes available to a single user           (-u) 3546
Maximum amount of virtual memory available to the shell      (kB, -v) unlimited

Here is the test result:

NodeJS version:

recreate node project:

npm init -y
// copy the `benchmark_server.js` to current folder
npm install restify restify-errors

node --version
# v14.16.0

Node spwan 6 cluster workers to serve:

node benchmark_server.js

# setupMaster Cluster worker amount: 6
# setupMaster Cluster worker "1" (PID: 10719) is online.
# setupMaster Cluster worker "3" (PID: 10721) is online.
# setupMaster Cluster worker "2" (PID: 10720) is online.
# setupMaster Cluster worker "4" (PID: 10722) is online.
# setupMaster Cluster worker "5" (PID: 10723) is online.
# setupMaster Cluster worker "6" (PID: 10724) is online.
# run Worker Process 3 (PID: 10721) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "3" (PID: 10721) is listening on 127.0.0.1:8080.
# run Worker Process 2 (PID: 10720) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "2" (PID: 10720) is listening on 127.0.0.1:8080.
# run Worker Process 6 (PID: 10724) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "6" (PID: 10724) is listening on 127.0.0.1:8080.
# run Worker Process 1 (PID: 10719) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "1" (PID: 10719) is listening on 127.0.0.1:8080.
# run Worker Process 4 (PID: 10722) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "4" (PID: 10722) is listening on 127.0.0.1:8080.
# run Worker Process 5 (PID: 10723) |  "Benchmark Http Server" is running at http://127.0.0.1:8080
# setupMaster Cluster worker "5" (PID: 10723) is listening on 127.0.0.1:8080.

# `/` default route
wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/

Running 10s test @ http://127.0.0.1:8080/
  8 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.21ms    7.06ms 165.76ms   90.33%
    Req/Sec    10.55k     5.94k   40.11k    80.46%
  Latency Distribution
     50%    5.57ms
     75%    9.21ms
     90%   14.07ms
     99%   31.48ms
  769416 requests in 10.06s, 151.16MB read
  Socket errors: connect 0, read 1251, write 0, timeout 0
Requests/sec:  76466.35
Transfer/sec:     15.02MB

# `/json-benchmark` route
wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/json-benchmark

Running 10s test @ http://127.0.0.1:8080/json-benchmark
  8 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.45ms    8.56ms 327.08ms   92.21%
    Req/Sec     9.94k     4.52k   28.71k    73.51%
  Latency Distribution
     50%    6.66ms
     75%    9.82ms
     90%   15.27ms
     99%   34.71ms
  729305 requests in 10.06s, 206.57MB read
  Socket errors: connect 0, read 1488, write 3, timeout 0
Requests/sec:  72481.25
Transfer/sec:     20.53MB

Rust version:

recreate node project:

cargo new benchmark

# Add the dependencies to `Cargo.toml`:
tide = "~0.15"
async-std = { version = "1.8.0", features = ["attributes"] }
async-trait = "^0.1.41"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0

# Build release version
cargo build --bin benchmark_server --release

./target/release/benchmark_server
[ Benchmark Server Demo ]

Benchmark Server is listening on: 0.0.0.0:8080

# `/` default route
wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/

Running 10s test @ http://127.0.0.1:8080/
  8 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    10.16ms    6.56ms 228.77ms   69.81%
    Req/Sec     8.15k     5.77k   42.38k    88.54%
  Latency Distribution
     50%    9.14ms
     75%   14.20ms
     90%   17.50ms
     99%   23.26ms
  577136 requests in 10.08s, 73.82MB read
  Socket errors: connect 0, read 1493, write 3, timeout 0
Requests/sec:  57266.86
Transfer/sec:      7.32MB

# `/json-benchmark` route
wrk --thread 8 --connections 5000 --duration 10s --latency http://127.0.0.1:8080/json-benchmark

Running 10s test @ http://127.0.0.1:8080/json-benchmark
  8 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.39ms    5.93ms 216.63ms   66.60%
    Req/Sec     7.73k     3.47k   24.06k    67.97%
  Latency Distribution
     50%   10.51ms
     75%   15.89ms
     90%   18.91ms
     99%   22.02ms
  573039 requests in 10.08s, 119.74MB read
  Socket errors: connect 0, read 1120, write 0, timeout 0
Requests/sec:  56862.43
Transfer/sec:     11.88MB

fiag commented 3 years ago

Maybe same to this issue #781

wisonye commented 3 years ago

Maybe same to this issue #781

So it means I need to wait for the next fix? as your PR didn't merge yet.

Or any workaround I can use at this moment? plz :)

fiag commented 3 years ago

try this. Use this patch to your Cargo.toml.

[patch.crates-io]
tide = { git = 'https://github.com/fiag/tide.git', branch='tcp-nodelay' }

wisonye commented 3 years ago

try this. Use this patch to your Cargo.toml.
[patch.crates-io]
tide = { git = 'https://github.com/fiag/tide.git', branch='tcp-nodelay' }

Hi flag, thanks for the patch, and actually .... the result is quite funny...:)

After adding your patch to Cargo.tom, I run:

# Actually I also delete the `Cargo.Lock` file
cargo update 
cargo clean && cargo build --bin benchmark_server --release

So I run the release version and test it again:

After that, test the node version again:

And here is the result, I took a screenshot and then align them together which easier to compare:

As you can see above, the Rust SHOULD fast than the node version, as the latency is low and balabala... which highlighted in the green colour. But somehow the node version can handle a lot of connections than the Rust one. That's why the Final result shows the node version got more throughput... (And btw, I use my iMac to run the above test, that's why the result is different from the very beginning when I created this issue which runs on my MacBookPro).

I've already considered that the Node version spawns a few child processes (even the ps command shows it got the same Thread amount with the rust binary), but tide use async-std which means it still spawns the same number of threads (with my CPU core amount), as the ASYNC_STD_THREAD_COUNT does that by default. Also, async-std uses the Rust Async model which should be efficient than the normal IPC which the node's cluster module use. ..... I just don't get that why the final result looks like that, could anyone try to have a try on this, plz :)

wisonye commented 3 years ago

Also, I got another same situation and comparison result in my production service. I built a Binary Protocol Parser for encoding/decoding the hardware network data which transfers via TCP.

I made a performance test for both Typescripts (run in Node) and Rust version (release binary). The test very simple: Just run the decode function in a for loop to parse the same lines of binary protocol data (basically, just a brunch of byte[] / [u8]).

But the result is pretty funny which is the TypeScript one got more throughput than the Rust one. I consider that:

Maybe in every for loop scope, the Rust version always re-allocate and de-allocate all the local var memories (should be almost a few million operations during the test), as I saw the rust binary memory footprint can keep around 428KB and it's very stable.

But the node version, it uses around 32MB to run through the test (for getting that high throughput result). So I guess, the V8 never run the GC (to free the memory)? :)

Is that the same potential reason for the tide test result above?

kennetpostigo commented 3 years ago

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

wisonye commented 3 years ago

Yup, it's not out of scope:) And I also want to see how other frameworks perform as well. If you have time then add a minimal demo here and see what happen:)

slhmy commented 3 years ago

I found tide slower than spring😂

Fishrock123 commented 3 years ago

Sorry, I've been too busy using Tide in prod to dig into this.

I can tell you from production experience: it is orders of magnitude faster than Node.js for common workloads.

Fishrock123 commented 3 years ago

autocannon against your node.js example:

autocannon 192.168.0.10:8080 -c 16 -W -w 8 -d 20
Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers

Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 1 ms │ 1 ms  │ 1 ms │ 0.61 ms │ 1.46 ms │ 155 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬────────┬─────────┬─────────┬────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%    │ 97.5%   │ Avg     │ Stdev  │ Min     │
├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤
│ Req/Sec   │ 15199   │ 15199   │ 15535  │ 17679   │ 15782.4 │ 644.56 │ 15198   │
├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤
│ Bytes/Sec │ 3.13 MB │ 3.13 MB │ 3.2 MB │ 3.64 MB │ 3.25 MB │ 133 kB │ 3.13 MB │
└───────────┴─────────┴─────────┴────────┴─────────┴─────────┴────────┴─────────┘

Req/Bytes counts sampled once per second.

316k requests in 20.05s, 65 MB read

Autocannon against Tide (--release):

autocannon 192.168.0.10:8080 -c 16 -W -w 8 -d 20
Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers

Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev  │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼────────┤
│ Latency │ 0 ms │ 1 ms │ 2 ms  │ 2 ms │ 1.16 ms │ 0.9 ms │ 110 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴────────┘
┌───────────┬────────┬────────┬─────────┬─────────┬─────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%     │ 97.5%   │ Avg     │ Stdev  │ Min    │
├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤
│ Req/Sec   │ 8967   │ 8967   │ 9167    │ 11063   │ 9449.6  │ 530.04 │ 8966   │
├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤
│ Bytes/Sec │ 1.2 MB │ 1.2 MB │ 1.23 MB │ 1.48 MB │ 1.27 MB │ 71 kB  │ 1.2 MB │
└───────────┴────────┴────────┴─────────┴─────────┴─────────┴────────┴────────┘

Req/Bytes counts sampled once per second.

189k requests in 20.04s, 25.3 MB read

That's kinda odd. It's definitely not what we observe but we also don't stress our Rust processes much (because they are plenty fast to carry our load).

Notes: This was done by running the benchmarker on my laptop (a slower machine) against the server examples on my desktop (a faster machine). Everything is wired together on gigabit ethernet.

Fishrock123 commented 3 years ago

Linux perf counter stats seem to indicate this is artificial (possibly TCP no_delay related):

 Performance counter stats for 'node benchmark_server.js':

        117,093.95 msec task-clock                #    2.461 CPUs utilized          
           469,264      context-switches          #    0.004 M/sec                  
            86,414      cpu-migrations            #    0.738 K/sec                  
           102,170      page-faults               #    0.873 K/sec                  
   272,687,722,707      cycles                    #    2.329 GHz                    
   122,404,642,364      instructions              #    0.45  insn per cycle         
    25,706,131,492      branches                  #  219.534 M/sec                  
     1,663,729,936      branch-misses             #    6.47% of all branches

  Performance counter stats for 'cargo run --release':

         49,355.48 msec task-clock                #    0.430 CPUs utilized          
           863,669      context-switches          #    0.017 M/sec                  
            23,368      cpu-migrations            #    0.473 K/sec                  
             7,929      page-faults               #    0.161 K/sec                  
    85,474,611,529      cycles                    #    1.732 GHz                    
    43,426,351,109      instructions              #    0.51  insn per cycle         
     8,534,130,491      branches                  #  172.912 M/sec                  
       529,585,015      branch-misses             #    6.21% of all branches

Of note there, Tide does a bunch more context switching, but it's not too bad, I think.

Tide however uses less than a third of the cpu cycles.

Fishrock123 commented 3 years ago

With tcp no_delay enabled via @jbr's draft PR (https://github.com/http-rs/tide/pull/823) I get:

Running 20s warmup @ http://192.168.0.10:8080
16 connections
8 workers

Running 20s test @ http://192.168.0.10:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 1 ms  │ 2 ms │ 0.53 ms │ 0.98 ms │ 130 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 12887   │ 12887   │ 14711   │ 19167   │ 15810.6 │ 2260.19 │ 12885   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 1.73 MB │ 1.73 MB │ 1.97 MB │ 2.57 MB │ 2.12 MB │ 303 kB  │ 1.73 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

316k requests in 20.03s, 42.4 MB read

Which is about on-par. I think my laptop is now the limiting factor. I'll try to run the benchmark in reverse.

Fishrock123 commented 3 years ago

Also, for that last example, we're still using only about half the cpu cycles for the same number of requests as Node.

 Performance counter stats for 'cargo run --release':

         55,935.65 msec task-clock                #    0.565 CPUs utilized          
         1,361,880      context-switches          #    0.024 M/sec                  
            22,762      cpu-migrations            #    0.407 K/sec                  
             7,881      page-faults               #    0.141 K/sec                  
   126,799,597,556      cycles                    #    2.267 GHz                    
    71,088,504,957      instructions              #    0.56  insn per cycle         
    13,899,982,207      branches                  #  248.500 M/sec                  
       583,560,089      branch-misses             #    4.20% of all branches

Fishrock123 commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

slhmy commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

I run tfb with TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.

**TCP_NODELAY**
8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   644.08ms    1.07s    3.67s    81.67%
    Req/Sec    12.16k     4.52k   14.46k    87.83%

Comparing with..

**tide = "0.16.0"**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.47ms   16.24ms  45.97ms   80.29%
    Req/Sec    93.20     32.76   272.00     68.75%

The result looks really strange. I'm not familiar with this topic, but I think the following link will help. https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

Also currently warp has a more satisfying result on my computer.

**warp**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   497.54us  550.14us  18.32ms   97.70%
    Req/Sec     2.02k   309.63     2.96k    78.77%

It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.

kennetpostigo commented 3 years ago

@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye

slhmy commented 3 years ago

@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye

Don't know if flamegraph(https://github.com/flamegraph-rs/flamegraph) will help... I'm kind of busy nowadays.

Fishrock123 commented 3 years ago

I want to take bottom-up perf stacks but don't know how to offhand with rust (and I am super busy).

fiag commented 3 years ago

autocannon to benchmark_server.js

❯ autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers

Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 1 ms │ 0.34 ms │ 7.63 ms │ 376 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 16879   │ 16879   │ 32895   │ 39871   │ 31267.6 │ 5363.64 │ 16871   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 3.09 MB │ 3.09 MB │ 6.02 MB │ 7.29 MB │ 5.72 MB │ 981 kB  │ 3.09 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

625k requests in 20.21s, 114 MB read

autocanon tide --release, with TCP_NODELAY

❯ autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers

Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 1 ms │ 0.07 ms │ 1.68 ms │ 255 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 20015   │ 20015   │ 45023   │ 47871   │ 43102.4 │ 6650.74 │ 20001   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 2.68 MB │ 2.68 MB │ 6.03 MB │ 6.41 MB │ 5.78 MB │ 891 kB  │ 2.68 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

862k requests in 20.01s, 116 MB read

And make a flamegraph. flamegraph.svg.zip

wisonye commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

I run tfb with TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.
**TCP_NODELAY**
8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   644.08ms    1.07s    3.67s    81.67%
    Req/Sec    12.16k     4.52k   14.46k    87.83%
Comparing with..
**tide = "0.16.0"**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.47ms   16.24ms  45.97ms   80.29%
    Req/Sec    93.20     32.76   272.00     68.75%
The result looks really strange. I'm not familiar with this topic, but I think the following link will help. https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

Also currently warp has a more satisfying result on my computer.
**warp**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   497.54us  550.14us  18.32ms   97.70%
    Req/Sec     2.02k   309.63     2.96k    78.77%
It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.

@slhmy Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between async-std and tokio, as that's a bigger difference under the hood:) It very depends :)

wisonye commented 3 years ago

autocannon to benchmark_server.js

❯ autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers

Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 1 ms │ 0.34 ms │ 7.63 ms │ 376 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 16879   │ 16879   │ 32895   │ 39871   │ 31267.6 │ 5363.64 │ 16871   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 3.09 MB │ 3.09 MB │ 6.02 MB │ 7.29 MB │ 5.72 MB │ 981 kB  │ 3.09 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

625k requests in 20.21s, 114 MB read

autocanon tide --release, with TCP_NODELAY

❯ autocannon 192.168.100.108:8080 -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @ http://192.168.100.108:8080
16 connections
8 workers

Running 20s test @ http://192.168.100.108:8080
16 connections
8 workers

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max    │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼────────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 1 ms │ 0.07 ms │ 1.68 ms │ 255 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 20015   │ 20015   │ 45023   │ 47871   │ 43102.4 │ 6650.74 │ 20001   │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 2.68 MB │ 2.68 MB │ 6.03 MB │ 6.41 MB │ 5.78 MB │ 891 kB  │ 2.68 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

862k requests in 20.01s, 116 MB read

And make a flamegraph. flamegraph.svg.zip

@fiag .....That's funny:) But I remember that I did give it a try based on your patch branch for the TCP_NO_DELAY settings, and the result I got looks no much different. How it comes it looks like a very big difference when you use it? :)

slhmy commented 3 years ago

@slhmy Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between and , as that's a bigger difference under the hood:) It very depends :)async-std``tokio

🦀 Maybe more comparisons need to be made.

I currently made actix-web work with sqlx (sqlx runs a tokio runtime which is compatible with actix-web 4.0-beta) and there is also some performance issue...(Check this issue. It is temporary solved by making querying in one connection)

I also found there is a huge performance loss if I put a async-sever into a docker machine.

Combining to the above relate, I guess maybe async-std consumes a lot of time to switch between threads, but I can't make flamegraph for computer reason...So, it's only my guess. 😂

wisonye commented 3 years ago

@slhmy Thanks for that:) Also, here is my personal opinion:

(Check this issue. It is temporary solved by making querying in one connection)

I did check that issue and I think jplatte answer is good. As usual, the Pool just responsible for **getting an exists and free connection back to you when you ask for; If not one exists, then create a new connection instance and cache it (in the hashmap usually). What you were trying to do is ask the pool to give you the new one and use it immediately (before it releases which more like none of free connection instance in the pool). I think that's why that "bug" shows up, just guess :)
I also found there is a huge performance loss if I put a async-sever into a docker machine.

I did use async-std in production and that high-performance TCP server is running inside docker swarm as well which I didn't see any slow issue. So what's your case actually?:)
I guess maybe async-std consumes a lot of time to switch between threads

I think you can go to async-std discord channel to ask them, as by default just spawn a few threads which based on how much CPU core you have, should not be a very big problem for switching between the threads (I think), and I should works like asking for a free OS/Native thread from the internal thread pool:)

slhmy commented 3 years ago

@wisonye

I also found there is a huge performance loss if I put a async-sever into a docker machine. I did use in production and that high-performance TCP server is running inside docker swarm as well which I didn't see any slow issue. So what's your case actually?:)async-std

Thanks a lot for your help, actually it''s related the issue I have post.

I run the service in a docker container, and actix+sqlx will cost more than 20s to request 500 rows in the database. However others like tide+sqlx will not(only cost around 300ms).

The 20s problem only appears in docker machine build by tfb debug mode (tfb debug mode will automatically run two containers one for database and one for the server), while the server is not in docker it won't cost so much time. Any way I also think your opinion is correct, so I will do more try and exclude all my personal issue if I could, then the guess may come into a result.