http-rs / tide

Fast and friendly HTTP server framework for async Rust
Apache License 2.0
5.05k stars 322 forks source link

Tide without TCP_NODELAY performs unfavorably in benchmarks #814

Open wisonye opened 3 years ago

wisonye commented 3 years ago

I wrote 2 performance testing minimal HTTP servers, one for tide (1.6) and one for Node.JS, what I expect is that the tide one should faster than the Node.JS one. But the result is the opposite..., anyone has a try on this?:)

Testing environment:

macOS 10.14.6
Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz [ 6 cores ]
rustc 1.50.0 (cb75ad5db 2021-02-10)

Here are the source code for both versions:

Both HTTP server have the same routes and responses below:

Here is my ulimit -a output:

Maximum size of core files created                           (kB, -c) 0
Maximum size of a processΥ³ data segment                     (kB, -d) unlimited
Maximum size of files created by the shell                   (kB, -f) unlimited
Maximum size that may be locked into memory                  (kB, -l) unlimited
Maximum resident set size                                    (kB, -m) unlimited
Maximum number of open file descriptors                          (-n) 1000000
Maximum stack size                                           (kB, -s) 8192
Maximum amount of cpu time in seconds                   (seconds, -t) unlimited
Maximum number of processes available to a single user           (-u) 3546
Maximum amount of virtual memory available to the shell      (kB, -v) unlimited

Here is the test result:

fiag commented 3 years ago

Maybe same to this issue #781

wisonye commented 3 years ago

Maybe same to this issue #781

So it means I need to wait for the next fix? as your PR didn't merge yet.

Or any workaround I can use at this moment? plz :)

fiag commented 3 years ago

try this. Use this patch to your Cargo.toml.

tide = { git = '', branch='tcp-nodelay' }
wisonye commented 3 years ago

try this. Use this patch to your Cargo.toml.

tide = { git = '', branch='tcp-nodelay' }

Hi flag, thanks for the patch, and actually .... the result is quite funny...:)

After adding your patch to Cargo.tom, I run:

# Actually I also delete the `Cargo.Lock` file
cargo update 
cargo clean && cargo build --bin benchmark_server --release

So I run the release version and test it again:


After that, test the node version again:


And here is the result, I took a screenshot and then align them together which easier to compare:


As you can see above, the Rust SHOULD fast than the node version, as the latency is low and balabala... which highlighted in the green colour. But somehow the node version can handle a lot of connections than the Rust one. That's why the Final result shows the node version got more throughput... (And btw, I use my iMac to run the above test, that's why the result is different from the very beginning when I created this issue which runs on my MacBookPro).

I've already considered that the Node version spawns a few child processes (even the ps command shows it got the same Thread amount with the rust binary), but tide use async-std which means it still spawns the same number of threads (with my CPU core amount), as the ASYNC_STD_THREAD_COUNT does that by default. Also, async-std uses the Rust Async model which should be efficient than the normal IPC which the node's cluster module use. ..... I just don't get that why the final result looks like that, could anyone try to have a try on this, plz :)

wisonye commented 3 years ago

Also, I got another same situation and comparison result in my production service. I built a Binary Protocol Parser for encoding/decoding the hardware network data which transfers via TCP.

I made a performance test for both Typescripts (run in Node) and Rust version (release binary). The test very simple: Just run the decode function in a for loop to parse the same lines of binary protocol data (basically, just a brunch of byte[] / [u8]).

But the result is pretty funny which is the TypeScript one got more throughput than the Rust one. I consider that:

Maybe in every for loop scope, the Rust version always re-allocate and de-allocate all the local var memories (should be almost a few million operations during the test), as I saw the rust binary memory footprint can keep around 428KB and it's very stable.

But the node version, it uses around 32MB to run through the test (for getting that high throughput result). So I guess, the V8 never run the GC (to free the memory)? :)

Is that the same potential reason for the tide test result above?

kennetpostigo commented 3 years ago

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

wisonye commented 3 years ago

Yup, it's not out of scope:) And I also want to see how other frameworks perform as well. If you have time then add a minimal demo here and see what happen:)

slhmy commented 3 years ago

I found tide slower than springπŸ˜‚

Fishrock123 commented 3 years ago

Sorry, I've been too busy using Tide in prod to dig into this.

I can tell you from production experience: it is orders of magnitude faster than Node.js for common workloads.

Fishrock123 commented 3 years ago

autocannon against your node.js example:

autocannon -c 16 -W -w 8 -d 20
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 1 ms β”‚ 1 ms  β”‚ 1 ms β”‚ 0.61 ms β”‚ 1.46 ms β”‚ 155 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%    β”‚ 97.5%   β”‚ Avg     β”‚ Stdev  β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 15199   β”‚ 15199   β”‚ 15535  β”‚ 17679   β”‚ 15782.4 β”‚ 644.56 β”‚ 15198   β”‚
β”‚ Bytes/Sec β”‚ 3.13 MB β”‚ 3.13 MB β”‚ 3.2 MB β”‚ 3.64 MB β”‚ 3.25 MB β”‚ 133 kB β”‚ 3.13 MB β”‚

Req/Bytes counts sampled once per second.

316k requests in 20.05s, 65 MB read

Autocannon against Tide (--release):

autocannon -c 16 -W -w 8 -d 20
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev  β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 1 ms β”‚ 2 ms  β”‚ 2 ms β”‚ 1.16 ms β”‚ 0.9 ms β”‚ 110 ms β”‚
β”‚ Stat      β”‚ 1%     β”‚ 2.5%   β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev  β”‚ Min    β”‚
β”‚ Req/Sec   β”‚ 8967   β”‚ 8967   β”‚ 9167    β”‚ 11063   β”‚ 9449.6  β”‚ 530.04 β”‚ 8966   β”‚
β”‚ Bytes/Sec β”‚ 1.2 MB β”‚ 1.2 MB β”‚ 1.23 MB β”‚ 1.48 MB β”‚ 1.27 MB β”‚ 71 kB  β”‚ 1.2 MB β”‚

Req/Bytes counts sampled once per second.

189k requests in 20.04s, 25.3 MB read

That's kinda odd. It's definitely not what we observe but we also don't stress our Rust processes much (because they are plenty fast to carry our load).

Notes: This was done by running the benchmarker on my laptop (a slower machine) against the server examples on my desktop (a faster machine). Everything is wired together on gigabit ethernet.

Fishrock123 commented 3 years ago

Linux perf counter stats seem to indicate this is artificial (possibly TCP no_delay related):

 Performance counter stats for 'node benchmark_server.js':

        117,093.95 msec task-clock                #    2.461 CPUs utilized          
           469,264      context-switches          #    0.004 M/sec                  
            86,414      cpu-migrations            #    0.738 K/sec                  
           102,170      page-faults               #    0.873 K/sec                  
   272,687,722,707      cycles                    #    2.329 GHz                    
   122,404,642,364      instructions              #    0.45  insn per cycle         
    25,706,131,492      branches                  #  219.534 M/sec                  
     1,663,729,936      branch-misses             #    6.47% of all branches    
  Performance counter stats for 'cargo run --release':

         49,355.48 msec task-clock                #    0.430 CPUs utilized          
           863,669      context-switches          #    0.017 M/sec                  
            23,368      cpu-migrations            #    0.473 K/sec                  
             7,929      page-faults               #    0.161 K/sec                  
    85,474,611,529      cycles                    #    1.732 GHz                    
    43,426,351,109      instructions              #    0.51  insn per cycle         
     8,534,130,491      branches                  #  172.912 M/sec                  
       529,585,015      branch-misses             #    6.21% of all branches 

Of note there, Tide does a bunch more context switching, but it's not too bad, I think.

Tide however uses less than a third of the cpu cycles.

Fishrock123 commented 3 years ago

With tcp no_delay enabled via @jbr's draft PR ( I get:

Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 0 ms β”‚ 1 ms  β”‚ 2 ms β”‚ 0.53 ms β”‚ 0.98 ms β”‚ 130 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev   β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 12887   β”‚ 12887   β”‚ 14711   β”‚ 19167   β”‚ 15810.6 β”‚ 2260.19 β”‚ 12885   β”‚
β”‚ Bytes/Sec β”‚ 1.73 MB β”‚ 1.73 MB β”‚ 1.97 MB β”‚ 2.57 MB β”‚ 2.12 MB β”‚ 303 kB  β”‚ 1.73 MB β”‚

Req/Bytes counts sampled once per second.

316k requests in 20.03s, 42.4 MB read

Which is about on-par. I think my laptop is now the limiting factor. I'll try to run the benchmark in reverse.

Fishrock123 commented 3 years ago

Also, for that last example, we're still using only about half the cpu cycles for the same number of requests as Node.

 Performance counter stats for 'cargo run --release':

         55,935.65 msec task-clock                #    0.565 CPUs utilized          
         1,361,880      context-switches          #    0.024 M/sec                  
            22,762      cpu-migrations            #    0.407 K/sec                  
             7,881      page-faults               #    0.141 K/sec                  
   126,799,597,556      cycles                    #    2.267 GHz                    
    71,088,504,957      instructions              #    0.56  insn per cycle         
    13,899,982,207      branches                  #  248.500 M/sec                  
       583,560,089      branch-misses             #    4.20% of all branches    
Fishrock123 commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

slhmy commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

I run tfb with TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.

8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   644.08ms    1.07s    3.67s    81.67%
    Req/Sec    12.16k     4.52k   14.46k    87.83%

Comparing with..

**tide = "0.16.0"**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.47ms   16.24ms  45.97ms   80.29%
    Req/Sec    93.20     32.76   272.00     68.75%

The result looks really strange. I'm not familiar with this topic, but I think the following link will help.

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

Also currently warp has a more satisfying result on my computer.

  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   497.54us  550.14us  18.32ms   97.70%
    Req/Sec     2.02k   309.63     2.96k    78.77%

It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.

kennetpostigo commented 3 years ago

@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye

slhmy commented 3 years ago

@Fishrock123 @wisonye @slhmy Are there any tools to inspect where time is spent when the server is running, might offer some clues as to what you were seeing @wisonye

Don't know if flamegraph( will help... I'm kind of busy nowadays.

Fishrock123 commented 3 years ago

I want to take bottom-up perf stacks but don't know how to offhand with rust (and I am super busy).

fiag commented 3 years ago

autocannon to benchmark_server.js

❯ autocannon -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 0 ms β”‚ 0 ms  β”‚ 1 ms β”‚ 0.34 ms β”‚ 7.63 ms β”‚ 376 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev   β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 16879   β”‚ 16879   β”‚ 32895   β”‚ 39871   β”‚ 31267.6 β”‚ 5363.64 β”‚ 16871   β”‚
β”‚ Bytes/Sec β”‚ 3.09 MB β”‚ 3.09 MB β”‚ 6.02 MB β”‚ 7.29 MB β”‚ 5.72 MB β”‚ 981 kB  β”‚ 3.09 MB β”‚

Req/Bytes counts sampled once per second.

625k requests in 20.21s, 114 MB read

autocanon tide --release, with TCP_NODELAY

❯ autocannon -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 0 ms β”‚ 0 ms  β”‚ 1 ms β”‚ 0.07 ms β”‚ 1.68 ms β”‚ 255 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev   β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 20015   β”‚ 20015   β”‚ 45023   β”‚ 47871   β”‚ 43102.4 β”‚ 6650.74 β”‚ 20001   β”‚
β”‚ Bytes/Sec β”‚ 2.68 MB β”‚ 2.68 MB β”‚ 6.03 MB β”‚ 6.41 MB β”‚ 5.78 MB β”‚ 891 kB  β”‚ 2.68 MB β”‚

Req/Bytes counts sampled once per second.

862k requests in 20.01s, 116 MB read

And make a flamegraph.

wisonye commented 3 years ago

I am going to caution that no_delay may be ideal for this benchmarking workload but may not be ideal in the real world.

I run tfb with TCP_NODELAY, it has a big improvement in Req/Sec, but the Latency increased a lot.

8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   644.08ms    1.07s    3.67s    81.67%
    Req/Sec    12.16k     4.52k   14.46k    87.83%

Comparing with..

**tide = "0.16.0"**
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    15.47ms   16.24ms  45.97ms   80.29%
    Req/Sec    93.20     32.76   272.00     68.75%

The result looks really strange. I'm not familiar with this topic, but I think the following link will help.

@wisonye might be out of scope, but I'd be curious to see how other web frameworks written in rust perform here, and if they produce similar results. Might be out of scope, but it made me curious

Also currently warp has a more satisfying result on my computer.

  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   497.54us  550.14us  18.32ms   97.70%
    Req/Sec     2.02k   309.63     2.96k    78.77%

It's understandably microbenchmark doesn't reflect the whole real word. But the result can't persuade me that tide is good enough like other web app framework.

@slhmy Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between async-std and tokio, as that's a bigger difference under the hood:) It very depends :)

wisonye commented 3 years ago

autocannon to benchmark_server.js

❯ autocannon -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 0 ms β”‚ 0 ms  β”‚ 1 ms β”‚ 0.34 ms β”‚ 7.63 ms β”‚ 376 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev   β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 16879   β”‚ 16879   β”‚ 32895   β”‚ 39871   β”‚ 31267.6 β”‚ 5363.64 β”‚ 16871   β”‚
β”‚ Bytes/Sec β”‚ 3.09 MB β”‚ 3.09 MB β”‚ 6.02 MB β”‚ 7.29 MB β”‚ 5.72 MB β”‚ 981 kB  β”‚ 3.09 MB β”‚

Req/Bytes counts sampled once per second.

625k requests in 20.21s, 114 MB read

autocanon tide --release, with TCP_NODELAY

❯ autocannon -c 16 -W -w 8 -d 20                                                            (base)
Running 20s warmup @
16 connections
8 workers

Running 20s test @
16 connections
8 workers

β”‚ Stat    β”‚ 2.5% β”‚ 50%  β”‚ 97.5% β”‚ 99%  β”‚ Avg     β”‚ Stdev   β”‚ Max    β”‚
β”‚ Latency β”‚ 0 ms β”‚ 0 ms β”‚ 0 ms  β”‚ 1 ms β”‚ 0.07 ms β”‚ 1.68 ms β”‚ 255 ms β”‚
β”‚ Stat      β”‚ 1%      β”‚ 2.5%    β”‚ 50%     β”‚ 97.5%   β”‚ Avg     β”‚ Stdev   β”‚ Min     β”‚
β”‚ Req/Sec   β”‚ 20015   β”‚ 20015   β”‚ 45023   β”‚ 47871   β”‚ 43102.4 β”‚ 6650.74 β”‚ 20001   β”‚
β”‚ Bytes/Sec β”‚ 2.68 MB β”‚ 2.68 MB β”‚ 6.03 MB β”‚ 6.41 MB β”‚ 5.78 MB β”‚ 891 kB  β”‚ 2.68 MB β”‚

Req/Bytes counts sampled once per second.

862k requests in 20.01s, 116 MB read

And make a flamegraph.

@fiag .....That's funny:) But I remember that I did give it a try based on your patch branch for the TCP_NO_DELAY settings, and the result I got looks no much different. How it comes it looks like a very big difference when you use it? :)

slhmy commented 3 years ago

@slhmy Hey, sorry for late, so busy nowadays. And YES, you're right I think. Actually, I think the choice belongs to between and , as that's a bigger difference under the hood:) It very depends :)async-std``tokio

πŸ¦€ Maybe more comparisons need to be made.

I currently made actix-web work with sqlx (sqlx runs a tokio runtime which is compatible with actix-web 4.0-beta) and there is also some performance issue...(Check this issue. It is temporary solved by making querying in one connection)

I also found there is a huge performance loss if I put a async-sever into a docker machine.

Combining to the above relate, I guess maybe async-std consumes a lot of time to switch between threads, but I can't make flamegraph for computer reason...So, it's only my guess. πŸ˜‚

wisonye commented 3 years ago

@slhmy Thanks for that:) Also, here is my personal opinion:

slhmy commented 3 years ago


  • I also found there is a huge performance loss if I put a async-sever into a docker machine. I did use in production and that high-performance TCP server is running inside docker swarm as well which I didn't see any slow issue. So what's your case actually?:)async-std

Thanks a lot for your help, actually it''s related the issue I have post.

I run the service in a docker container, and actix+sqlx will cost more than 20s to request 500 rows in the database. However others like tide+sqlx will not(only cost around 300ms).

The 20s problem only appears in docker machine build by tfb debug mode (tfb debug mode will automatically run two containers one for database and one for the server), while the server is not in docker it won't cost so much time. Any way I also think your opinion is correct, so I will do more try and exclude all my personal issue if I could, then the guess may come into a result.