sfackler / rust-postgres

Native PostgreSQL driver for the Rust programming language
Apache License 2.0
3.44k stars 436 forks source link

125k errors/second and $800 of costs #975

Closed jakajancar closed 1 year ago

jakajancar commented 1 year ago

Hey,

I'm not sure whether to be impressed or scared. My application (using tokio-postgres) started producing ~125k errors per second (and incurred $800 of AWS CloudWatch Logs costs). The specific error was:

connection error: error communicating with the server: Socket not connected (os error 107)
Screenshot 2023-01-01 at 3 24 58 PM

This is coming from here:

async fn create_client(config: &tokio_postgres::Config) -> Result<tokio_postgres::Client, tokio_postgres::Error> {
    let mut config = config.clone();
    config.connect_timeout(Duration::from_secs(5)); // default is no timeout

    let (client, mut connection) = config.connect(make_tls()).await?;
    tokio::spawn(async move {
        let mut messages = futures::stream::poll_fn(move |cx| connection.poll_message(cx));
        loop {
            match messages.next().await {
                Some(Ok(AsyncMessage::Notice(notice))) => {
                    tracing::debug!("connection notice: {}", notice);
                }
                Some(Ok(_)) => {}
                Some(Err(error)) => {
                    tracing::error!("connection error: {}", error);
                }
                None => break
            }
        }
    });
    Ok(client)
}

I suspect the reason is that poll_message calls poll_shutdown, which calls Framed/Sink::poll_close, which returns Err. According to the documentation of Sink, "If this function encounters an error, the sink should be considered to have failed permanently, and no more Sink methods should be called.", however, tokio-postgres does not remember this and keeps calling it.

Or perhaps this is expected and I should stop reading after Err is returned, in which case it would be nice to explicitly document this. Because of the Option and the semantics of a stream, I assumed the interface was Stream-y and terminates after None is returned.

sfackler commented 1 year ago

The general expectation is that errors returned from poll_message are fatal and it shouldn't be called again. I can update the documentation to make that more clear.

jakajancar commented 1 year ago

👍🏻 So None is expected only for graceful disconnect?

Also, out of curiosity, is there a reason for not just implementing Stream?

sfackler commented 1 year ago

👍🏻 So None is expected only for graceful disconnect?

Yep

Also, out of curiosity, is there a reason for not just implementing Stream?

There's a lot of method name overlap between FutureExt and StreamExt, and the connection type already implements Future. Having both impls would prevent those combinators from being callable cleanly if both traits were in scope.

jakajancar commented 1 year ago

Good point, thanks!