hyperium / hyper

An HTTP library for Rust
https://hyper.rs
MIT License
14.58k stars 1.6k forks source link

Error: IncompleteMessage: connection closed before message completed #2136

Closed fhsgoncalves closed 4 years ago

fhsgoncalves commented 4 years ago

Hey, I'm experiencing a weird behavior with the hyper client when using https. Sometimes my app in production fails to perform the request, but the same request works most of the time. I performed a load test locally to try to reproduce the problem, and I could reproduce: it is occurring ~0.02% of the times.

I guess that it could be something related to the hyper-tls, so I switched to hyper-rustls, but the same problem continue to occur. So I tried to hit the url using http instead of https and the error went away!

The error I receive from hyper::Client::get is: hyper::Error(IncompleteMessage): connection closed before message completed.

Follow a minimal working example to reproduce the error:

Cargo.toml:

[dependencies]
hyper = "0.13"
tokio = { version = "0.2", features = ["full"] }
hyper-tls = "0.4.1"

src/main.rs:

use std::convert::Infallible;
use std::net::SocketAddr;

use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Client, Response, Server, Uri};
use hyper_tls::HttpsConnector;

pub type HttpClient = Client<HttpsConnector<hyper::client::connect::HttpConnector>>;

#[tokio::main]
async fn main() {
    let addr = SocketAddr::from(([0, 0, 0, 0], 8100));
    let client = Client::builder().build::<_, hyper::Body>(HttpsConnector::new());

    let make_service = make_service_fn(move |_| {
        let client = client.clone();
        async move { Ok::<_, Infallible>(service_fn(move |_req| handle(client.clone()) )) }
    });

    let server = Server::bind(&addr).serve(make_service);

    println!("Listening on http://{}", addr);

    if let Err(e) = server.await {
        eprintln!("server error: {}", e);
    }
}

async fn handle(client: HttpClient) -> Result<Response<Body>, hyper::Error> {

    let url = "https://url-here"; // CHANGE THE URL HERE!

    match client.get(url.parse::<Uri>().unwrap()).await {
        Ok(resp) => Ok(resp),
        Err(err) => { eprintln!("{:?} {}", err, err); Err(err) }
    }
}

PS: replace the url value with a valid https url. In my tests I used a small file on aws s3.

I performed a local load test using hey:

$ hey -z 120s -c 150 http://localhost:8100 

Running the test for 2 minutes (-z 120s) was enough to see some errors appearing.

Could anyone help me out? If I need to provide more information, or anything, just let me know. Thank you!

seanmonstar commented 4 years ago

This is just due to the racy nature of networking.

hyper has a connection pool of idle connections, and it selected one to send your request. Most of the time, hyper will receive the server's FIN and drop the dead connection from its pool. But occasionally, a connection will be selected from the pool and written to at the same time the server is deciding to close the connection. Since hyper already wrote some of the request, it can't really retry it automatically on a new connection, since the server may have acted already.

fhsgoncalves commented 4 years ago

Hey, thank you for the swift response!

I got it! So the connection is being reused, right? It is due the keep-alive option? If it is, disabling this flag, or performing a retry on the app side should solve the issue?


Also, I could not reproduce the error when requesting a url over http. I tried a lot of times, without success, I could only reproduce the issue when requesting a url over https.

If that is the reason, I should experienced the issue when using http too, right?

fhsgoncalves commented 4 years ago

I just found that aws s3 has a default max idle timeout of 20s, and hyper's default keep_alive_timeout is 90s.

Settings the keep_alive_timeout to less than 20s on hyper client seems to have solved the problem!

Thank you, your explanation really help me to understand why this was happening!

fhsgoncalves commented 4 years ago

I was looking at the java aws client, and I saw that they use the max-idle-timeout as 60s, but there is a second property called validate-after-inactivity (5s default) that allows the idle timeout be so high. Looking at the code, I saw that the http client they use supports this behavior.

It would be possible to implement the same behavior on hyper? Does it make sense? :smile:

seanmonstar commented 4 years ago

I believe the "revalidation" it does is to poll that it is readable. In hyper, we already register for when the OS discovers the connection has hung up. The race would still exist, if the "revalidation" happened at the same time the server was closing.

ronanyeah commented 3 years ago

Anyone getting this with reqwest, try this:

let client = reqwest::Client::builder()
    .pool_max_idle_per_host(0)
    .build()?;

https://github.com/wyyerd/stripe-rs/pull/172

Rudo2204 commented 3 years ago

Well, I tried to use .pool_max_idle_per_host(0) and I still got this error today.

loyd commented 3 years ago

Doesn't hyper take into account the Keep-Alive header?

I've faced this problem with ClickHouse HTTP crate, however, ClickHouse sends Keep-Alive: timeout=3, so I don't understand, why hyper doesn't handle it.

@seanmonstar, any ideas?

joleeee commented 12 months ago

I think it's unexpected for most people that this isn't automatically retried? If i ask the library to get a website for me i expect it to not fail because the keep alive timed out. If it should use keepalive by default it should also be able to handle it properly?

Am I misunderstanding anything here? I think closing this as completed is misleading :-)

cschramm commented 12 months ago

it should also be able to handle it properly?

It's just not possible conceptually, is it? See https://github.com/hyperium/hyper/issues/2136#issuecomment-589345238

Rather the application developer has to decide if the request should get retried, e.g. if it's a well behaving GET request or if it's visible on the application level that the request did or did not have the desired effect yet.

Side note: There seem to be some weird servers that silently time out connections, meaning that they do not close the connection when their timeout is reached but unconditionally close it as soon as they get reused later. While you can counteract that with a suitable pool_idle_timeout, I think it would be possible for the client to trigger that behavior before sending an actual request. It would still be racy as any connection if it does not trigger, though.