Closed joshualex closed 1 year ago
I have a suspicion that this might be a macOS bug, as nextest has seem a very similar bug be reported: https://github.com/nextest-rs/nextest/issues/262.
While it is most definitely a mac-specific bug (or M1-specific at least), I honestly doubt the relation to the nextest bug. The nextest bug seems more centered around process creation and directory listing, which hyper should not be doing on any platform :-) The example above is purely network-related, so no process creation or file I/O should be present here...
EDIT: file I/O is not directly related. The nextest bug is because nextest spawns a lot of processes, which requires approval from macOS's security validation. My example is just a single process, so this is not relevant in my case
This is very surprising. I've never seen the performance get worse in release mode. I don't have a Mac, so I can't really dig in, but it seems unlikely it's because of hyper.
Could it be the rustc version combined with your macOS version?
It reproduces across multiple versions of rustc, and I can't get it to reproduce on plain computational code (e.g. building and looking up in hashmaps, loops and numeric operations) - in all those cases, release performs significantly better than debug.
I might be able to take a stab at trying it with plain tokio tcp handling, seeing if it reproduces then...
I found that changing tokio::main
to use the current_thread
flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...
In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...
I'd love to unpack this a bit!
Please sound off with objections to how I interpreted the issues, my hypotheses, testing methodologies, or result analysis. I am very open to feedback :heart:
Running it and benchmarking it with hey (hey http://127.0.0.1:8888). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").
...
As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...
Rustc optimizations improve hyper performance. However, application-external aspects of the full request lifecycle introduce enough variance to result in comparable performance across compiler profiles when measured as a whole. This can be exaggerated with small sample sets and powerful hardware.
What is the appropriate way to describe the aspect of work in a TCP socket transaction that remains constant (or dynamic rather) regardless of the compiler profile being used?
Applications that wait for work, like web servers with pools of threads waiting on a socket, present a challenge for off-CPU analysis: often most of the blocking time will be in stacks waiting for work, rather than doing work.^1
"off-CPU" seems like a strong candidate for what my theory suggests.
My hypothesis is evolving to something more like, 'Rustc cannot make aspects of your machine, outside the application code, more efficient or consistent.'
hey's relevant, default configurations:
arg | desc |
---|---|
-n | Number of requests to run. Default is 200. |
-c | Number of workers to run concurrently. Total number of requests cannot be smaller than the concurrency level. Default is 50. |
-disable-keepalive | Disable keep-alive, prevents re-use of TCP connections between different HTTP requests. default is enabled |
-cpus | Number of used cpu cores. default for current machine is 8 cores |
My concerns about sample size are increasing, doubly so looking at the degree of concurrency.
To isolate the tool in testing, I will also be using ApacheBench with an equivalent command:
ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ | grep 'Time taken'
While it is most definitely a mac-specific bug (or M1-specific at least)
I found that changing tokio::main to use the current_thread flavor causes the performance to be better in release than debug, so I'm guessing the problem is scheduling or contention here...
In any case this doesn't reproduce with a plain tokio socket server imitating HTTP...
A socket server implemented with any combination of os, language, or framework will experience a degree of I/O variance. With a powerful computer that variance could over power the executing code's performance gains from compiler optimizations.
To isolate each component of this issue I have tested with different combinations of benchmarking tools, profiling tools, operating systems, and server implementations.
To flip this issue on its head I also put together a wildcard test case with every component (mac, rust, hyper) changed to look for a similar result [^2].
server | client command | MacOS (ARM) | Linux (container) | Windows 10 |
---|---|---|---|---|
hyper debug | hey http://localhost:8888 |
13.0 ms | ||
hyper release | hey http://localhost:8888 |
14.2 ms | ||
hyper debug | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
9.8 ms | 8.6 ms | 11.2 ms |
hyper release | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
10.6 ms | 7 ms | 11.6 ms |
hyper debug | ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/ |
401.8 ms | 286.6 ms | 838 ms |
hyper release | ab -k -n 50000 -c 50 -q http://127.0.0.1:8888/ |
358.2 ms | 230.4 ms | 582.8 ms |
hyper debug current_thread | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
12.4 ms | ||
hyper release current_thread | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
10.4 ms | ||
tokio debug | hey http://localhost:8888 |
13.2 ms | ||
tokio release | hey http://localhost:8888 |
12.2 ms | ||
tokio debug | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
11.2 ms | 6.4 ms | |
tokio release | ab -k -n 200 -c 50 -q http://127.0.0.1:8888/ |
11.5 ms | 6 ms | |
dotnet debug | ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/ |
40.8 ms | ||
dotnet release | ab -k -n 200 -c 50 -q http://127.0.0.1:5098/helloworld/ |
41 ms |
Let's go over each issue in turn and compare with the data.
Hyper performance degrades in release On all tested platforms there is a possibility for a release build to under perform debug with a small sample set. By shifting any of the present variables: handler processing, degree of concurrency, or number of requests, you can see release consistently out perform debug by a large margin.
Hyper issues are specific to Mac See the linux and windows tests. The reported issue is present on all platforms tested with the supplied benchmarks.
#[tokio::main(flavor = "current_thread")]
doesn't have the same issue
In the included results it shows that this is true. However, in larger tests I found that limiting tokio's runtime hurt overall performance with an adequately sized stress test. I believe the reasoning in question 1 holds true for this phenomenon.
A tokio TCP server doesn't have the same problems as hyper My tokio testings found that the same performance swings are present.
Next, I used cargo-flamegraph to take a closer look at the executing code to identify abnormal behaviors on MacOS between debug and release. The following graphs describe the code executed while handling: ab -k -n 200 -c 50 -q http://127.0.0.1:8888/
.
These graphs represent data produced with DTrace. (Click the pictures to visit interactive svg with searching!)
Debug:
Release:
Everything seems in order, but flamegraphs are a beginning, not an end.
Time spent on the CPU is not the full picture, as time is spent waiting for IO to complete as well, which does not get accounted with tools like perf that only measure what's consuming time on the CPU. Check out Brendan Gregg's article on Off-Cpu Accounting for more information about this
Let's take that next step by attaching tracing instruments to our hyper server! (Integration Reference | supplemental reading)
tracing_subscriber::fmt()
.with_span_events(FmtSpan::FULL)
.with_target(false)
.with_level(false)
.init();
#[instrument(skip_all)]
async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
Ok(Response::builder()
.body(Body::from("Hello World!\n"))
.unwrap())
}
Our server received a huge performance boost by compiling to release.
I believe hyper is in the clear. My testing did not demonstrate any profile-related performance shortcomings from hyper's codebase on Mac.
[^2]: C# is a compiled (and then interpreted) language with compiler profiles that claim to produce more efficient code in release mode.
Excellent write up. I think we can close up here.
Version
h2 v0.3.13 hyper v0.14.19 (full set of features enabled)
Platform Darwin ip-192-168-128-190.eu-west-1.compute.internal 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000 arm64
Description Trying a naive piece of code for benchmarking http performance:
Running it and benchmarking it with
hey
(hey http://127.0.0.1:8888
). All running on Rust 1.61.0, on a Macbook Pro with M1 Max (16").What I get is quite surprising, in debug mode, the results are as follows:
While in release mode (
cargo run --release
), I get this:As you can see, average, slowest, AND fastest request times are slower in release, as well as the total amount of time the benchmark took to run... I did not expect release performance to be the same as debug, not to mention worse...
Any help figuring this out would be greatly appreciated!