hyperium / hyper

An HTTP library for Rust
https://hyper.rs
MIT License
14.58k stars 1.6k forks source link

Server hangs on MacOS with keepalives disabled #3182

Closed ns476 closed 1 year ago

ns476 commented 1 year ago

Version Hyper 0.14.25 / Tokio 1.26.0

Platform MacOS 13.2.1

Description The following minimal server hangs eventually on MacOS when I make requests with keepalives disabled:

use std::{net::SocketAddr, convert::Infallible};

use hyper::{service::{make_service_fn, service_fn}, Body, Response, Request, Server};

async fn hello_world(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
    Ok(Response::new("Hello, World".into()))
}

#[tokio::main]
async fn main() {
    let addr = SocketAddr::from(([127, 0, 0, 1], 8080));

    let make_svc = make_service_fn(|_conn| async {
        // service_fn converts our function into a `Service`
        Ok::<_, Infallible>(service_fn(hello_world))
    });

    let server = Server::bind(&addr)
        .serve(make_svc);

    if let Err(e) = server.await {
        eprintln!("server error: {}", e);
    }
}

I can trigger the issue with ApacheBench:

$ ab -t 30 -c 20 http://127.0.0.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
<hangs>

# Enable keepalives with -k and it works
$ ab -k -t 30 -c 20 http://127.0.0.1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 5000 requests
Completed 10000 requests
Completed 15000 requests
Completed 20000 requests
Completed 25000 requests
Completed 30000 requests
Completed 35000 requests
Completed 40000 requests
Completed 45000 requests
Completed 50000 requests
Finished 50000 requests

It also occurs with k6 if I disable keepalives in the server with .http1_keepalive(false).

Exactly the same program works fine on Linux so I am fairly confident this is MacOS specific.

seanmonstar commented 1 year ago

Interesting. Enabling logs might help debug. Though, I don't have access to a Mac.

IsaacCloos commented 1 year ago

πŸ‘‹πŸ»

I've got a Mac and had some free time this morning.

Testing

I ran @ns476 program on the following platforms with the following tests

πŸ‘¨πŸ»β€πŸ”¬ apache bench k6
MacOS (arm64) βœ… βœ…
Linux (container) βœ… β›”
Windows (windows 10) βœ… β›”

I found similar results. MacOS was stalling out in the 16k~ request range while windows and linux finished the test without issue. I also found that re-using connections (or enabling keep-alive) resulted in a successful test on all platforms.

Next, I threw together an equivalently basic web server with dotnet 7 (the mvc controllers api template) and node express js. All of my test results were the same. MacOS was stalling out without the ability to re-use its connections.

In all scenarios without reusable connections, you can even:

  1. kill the running server after it stalls
  2. start it again
  3. try again
  4. wait for x period of time before it starts the first request

Impression from this testing is it seems like an issue with the network stack outside of hyper on Mac.

Research

That magic 16k~ number is no coincidence, and neither is the fact that after exactly 30 seconds the test completed another 16k~ requests before stalling again.

On macOS the default ephemeral port range is 49152 to 65535, for a total of 16384 ports.^1

When a TCP connection is closed from the server side, the port doesn’t immediately available to be used because the connection will first transits into TIME_WAIT state ... By default, MacOS have a msl time of 15 seconds. Hence, according to the specs, the connection will have to wait around 30 seconds before it can transits into CLOSED state.^2

While waiting for the Apache Bench to finish, you can witness the TIME_WAIT requests (16k of them) with this command: netstat -p tcp -n | grep TIME_WAIT

Interestingly, while running the test on linux and using an equivalent check: netstat -antu | grep TIME_WAIT

You can see that there are just as many connections in a TIME_WAIT state, but Linux doesn't have an issue determining that the request is actually finished and forgoes the default 60 second timer (double the length that MacOS makes you wait regardless of the requests state).

You can manually edit your systems TCP behavior on mac like this^2: sudo sysctl net.inet.tcp.msl=1000

This chews through 16k requests every 2 seconds or so (although it is not recommended to edit your system in this way).

You can also increase your available portrange for TCP connections like this^3:

$ sudo sysctl -w net.inet.ip.portrange.first=32768
net.inet.ip.portrange.first: 49152 -> 32768

This allows you to get through 32k~ at a time before you stall out.

Extra reading^4

Conclusion

This isn't hyper's issue, but I'm not entirely convinced that it has no agency in the matter. What it could do exactly I'm not sure, but it doesn't seem to be a problem that any of the other frameworks I tested address out-of-the-box either^5.

seanmonstar commented 1 year ago

Superb research and write-up, thank you! In this case, I'm going to close as not a problem with hyper. If there's a simple thing that is found to work in the future, perhaps we can do that.