joshualex commented 2 years ago

Version

h2 v0.3.13 hyper v0.14.19 (full set of features enabled)

Platform Darwin ip-192-168-128-190.eu-west-1.compute.internal 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64

Description Trying a naive piece of code for benchmarking http performance:

use std::convert::Infallible;

use hyper::{
    server::conn::AddrStream,
    service::{make_service_fn, service_fn},
    Body, Request, Response, Server,
};

#[tokio::main]
async fn main() {
    let app = make_service_fn(move |_stream: &AddrStream| async move {
        Ok::<_, Infallible>(service_fn(hello_world))
    });

    Server::bind(&([127, 0, 0, 1], 8888).into())
        .serve(app)
        .await
        .unwrap();
}

async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
    Ok(Response::builder()
        .body(Body::from("Hello World!\n"))
        .unwrap())
}

Running it and benchmarking it with hey (hey http://127.0.0.1:8888). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").

What I get is quite surprising, in debug mode, the results are as follows:

Summary:
  Total:    0.0090 secs
  Slowest:  0.0050 secs
  Fastest:  0.0010 secs
  Average:  0.0020 secs
  Requests/sec: 22206.2880

  Total data:   2600 bytes
  Size/request: 13 bytes

Response time histogram:
  0.001 [1] |
  0.001 [137]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [12]    |■■■■
  0.002 [0] |
  0.003 [0] |
  0.003 [0] |
  0.003 [0] |
  0.004 [4] |■
  0.004 [27]    |■■■■■■■■
  0.005 [17]    |■■■■■
  0.005 [2] |■

Latency distribution:
  10% in 0.0011 secs
  25% in 0.0012 secs
  50% in 0.0013 secs
  75% in 0.0037 secs
  90% in 0.0042 secs
  95% in 0.0043 secs
  99% in 0.0047 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0007 secs, 0.0010 secs, 0.0050 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0000 secs
  resp wait:    0.0013 secs, 0.0010 secs, 0.0025 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0000 secs

Status code distribution:
  [200] 200 responses

While in release mode (cargo run --release), I get this:

Summary:
  Total:    0.0122 secs
  Slowest:  0.0072 secs
  Fastest:  0.0013 secs
  Average:  0.0025 secs
  Requests/sec: 16395.8496

  Total data:   2600 bytes
  Size/request: 13 bytes

Response time histogram:
  0.001 [1] |
  0.002 [149]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [0] |
  0.003 [0] |
  0.004 [0] |
  0.004 [0] |
  0.005 [25]    |■■■■■■■
  0.005 [7] |■■
  0.006 [7] |■■
  0.007 [5] |■
  0.007 [6] |■■

Latency distribution:
  10% in 0.0014 secs
  25% in 0.0016 secs
  50% in 0.0017 secs
  75% in 0.0043 secs
  90% in 0.0053 secs
  95% in 0.0064 secs
  99% in 0.0072 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0009 secs, 0.0013 secs, 0.0072 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:    0.0016 secs, 0.0013 secs, 0.0028 secs
  resp read:    0.0000 secs, 0.0000 secs, 0.0000 secs

Status code distribution:
  [200] 200 responses

As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...

Any help figuring this out would be greatly appreciated!

davidbarsky commented 2 years ago

I have a suspicion that this might be a macOS bug, as nextest has seem a very similar bug be reported: https://github.com/nextest-rs/nextest/issues/262.

joshualex commented 2 years ago

While it is most definitely a mac-specific bug (or M1-specific at least), I honestly doubt the relation to the nextest bug. The nextest bug seems more centered around process creation and directory listing, which hyper should not be doing on any platform :-) The example above is purely network-related, so no process creation or file I/O should be present here...

EDIT: file I/O is not directly related. The nextest bug is because nextest spawns a lot of processes, which requires approval from macOS's security validation. My example is just a single process, so this is not relevant in my case

seanmonstar commented 2 years ago

This is very surprising. I've never seen the performance get worse in release mode. I don't have a Mac, so I can't really dig in, but it seems unlikely it's because of hyper.

Could it be the rustc version combined with your macOS version?

joshualex commented 2 years ago

It reproduces across multiple versions of rustc, and I can't get it to reproduce on plain computational code (e.g. building and looking up in hashmaps, loops and numeric operations) - in all those cases, release performs significantly better than debug.

I might be able to take a stab at trying it with plain tokio tcp handling, seeing if it reproduces then...

joshualex commented 2 years ago

I found that changing tokio::main to use the current_thread flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...

joshualex commented 2 years ago

In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...

IsaacCloos commented 1 year ago

👋

I'd love to unpack this a bit!

Please sound off with objections to how I interpreted the issues, my hypotheses, testing methodologies, or result analysis. I am very open to feedback :heart:

Issues

Hyper performance degrades when compiling with the default release profile

Conversation

Running it and benchmarking it with hey (hey http://127.0.0.1:8888). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").
...
As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...

Hypothesis

Rustc optimizations improve hyper performance. However, application-external aspects of the full request lifecycle introduce enough variance to result in comparable performance across compiler profiles when measured as a whole. This can be exaggerated with small sample sets and powerful hardware.

Research

What is the appropriate way to describe the aspect of work in a TCP socket transaction that remains constant (or dynamic rather) regardless of the compiler profile being used?

Applications that wait for work, like web servers with pools of threads waiting on a socket, present a challenge for off-CPU analysis: often most of the blocking time will be in stacks waiting for work, rather than doing work.^1

"off-CPU" seems like a strong candidate for what my theory suggests.

Notice how "Network I/O" is categorized in this way.

My hypothesis is evolving to something more like, 'Rustc cannot make aspects of your machine, outside the application code, more efficient or consistent.'

hey's relevant, default configurations:

arg	desc
-n	Number of requests to run. Default is 200.
-c	Number of workers to run concurrently. Total number of requests cannot be smaller than the concurrency level. Default is 50.
-disable-keepalive	Disable keep-alive, prevents re-use of TCP connections between different HTTP requests. *default is enabled*
-cpus	Number of used cpu cores. default for current machine is 8 cores

My concerns about sample size are increasing, doubly so looking at the degree of concurrency.

To isolate the tool in testing, I will also be using ApacheBench with an equivalent command:

ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ | grep 'Time taken'

Secondary Questions

Hyper performance degradation is specific to MacOS/Apple silicon

While it is most definitely a mac-specific bug (or M1-specific at least)

Tokio's multi-thread runtime can negatively impact hyper's performance in release mode

I found that changing tokio::main to use the current_thread flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...

A tokio tcp server equipped with http1.1 responses doesn't degrade in release mode

In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...

Hypothesis

A socket server implemented with any combination of os, language, or framework will experience a degree of I/O variance. With a powerful computer that variance could over power the executing code's performance gains from compiler optimizations.

Testing

To isolate each component of this issue I have tested with different combinations of benchmarking tools, profiling tools, operating systems, and server implementations.

To flip this issue on its head I also put together a wildcard test case with every component (mac, rust, hyper) changed to look for a similar result [^2].

server	client command	MacOS (ARM)	Linux (container)	Windows 10
hyper debug	`hey http://localhost:8888`	13.0 ms
hyper release	`hey http://localhost:8888`	14.2 ms
hyper debug	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	9.8 ms	8.6 ms	11.2 ms
hyper release	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	10.6 ms	7 ms	11.6 ms
hyper debug	`ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/`	401.8 ms	286.6 ms	838 ms
hyper release	`ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/`	358.2 ms	230.4 ms	582.8 ms
hyper debug current_thread	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	12.4 ms
hyper release current_thread	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	10.4 ms
tokio debug	`hey http://localhost:8888`	13.2 ms
tokio release	`hey http://localhost:8888`	12.2 ms
tokio debug	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	11.2 ms	6.4 ms
tokio release	`ab -k -n 200 -c 50 -q http://127.0.0.1:8888/`	11.5 ms	6 ms
dotnet debug	`ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/`			40.8 ms
dotnet release	`ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/`			41 ms

server titles link to the code
results are the mean time-to-complete of 5 consecutive runs
empty = N/A

Let's go over each issue in turn and compare with the data.

Hyper performance degrades in release On all tested platforms there is a possibility for a release build to under perform debug with a small sample set. By shifting any of the present variables: handler processing, degree of concurrency, or number of requests, you can see release consistently out perform debug by a large margin.
Hyper issues are specific to Mac See the linux and windows tests. The reported issue is present on all platforms tested with the supplied benchmarks.
#[tokio::main(flavor = "current_thread")] doesn't have the same issue In the included results it shows that this is true. However, in larger tests I found that limiting tokio's runtime hurt overall performance with an adequately sized stress test. I believe the reasoning in question 1 holds true for this phenomenon.
A tokio TCP server doesn't have the same problems as hyper My tokio testings found that the same performance swings are present.

Tracing

Next, I used cargo-flamegraph to take a closer look at the executing code to identify abnormal behaviors on MacOS between debug and release. The following graphs describe the code executed while handling: ab -k -n 200 -c 50 -q http://127.0.0.1:8888/.

These graphs represent data produced with DTrace. (Click the pictures to visit interactive svg with searching!)

Debug:

Release:

Everything seems in order, but flamegraphs are a beginning, not an end.

Time spent on the CPU is not the full picture, as time is spent waiting for IO to complete as well, which does not get accounted with tools like perf that only measure what's consuming time on the CPU. Check out Brendan Gregg's article on Off-Cpu Accounting for more information about this

Let's take that next step by attaching tracing instruments to our hyper server! (Integration Reference | supplemental reading)

tracing_subscriber::fmt()
        .with_span_events(FmtSpan::FULL)
        .with_target(false)
        .with_level(false)
        .init();

#[instrument(skip_all)]
async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
    Ok(Response::builder()
        .body(Body::from("Hello World!\n"))
        .unwrap())
}

Our server received a huge performance boost by compiling to release.

Conclusion

I believe hyper is in the clear. My testing did not demonstrate any profile-related performance shortcomings from hyper's codebase on Mac.

[^2]: C# is a compiled (and then interpreted) language with compiler profiles that claim to produce more efficient code in release mode.

seanmonstar commented 1 year ago

Excellent write up. I think we can close up here.

hyperium / hyper

Performance degradation in release mode compared to debug #2895

👋

Issues

Hyper performance degrades when compiling with the default release profile

Conversation

Hypothesis

Research

Secondary Questions

Hypothesis

Testing

Tracing

Conclusion