Context cancel hangs if there is no network connectivity

banks commented 7 years ago

Hi,

Not sure if this is a bug or if it is effectively the same as #584. But it appears to be a significant Gotcha if I'm understanding correctly.

Setup

package main

import (
    "context"
    "database/sql"
    "fmt"
    "time"

    _ "github.com/lib/pq"
)

func main() {
    db, err := sql.Open("postgres", "postgresql://127.0.0.1:5432/postgres?sslmode=disable&connect_timeout=3")
    if err != nil {
        panic(err)
    }

    for {
        ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
        rows, err := db.QueryContext(ctx, "SELECT 1")
        if err != nil {
            fmt.Println("ERROR:", err)
        }
        rows.Next()
        var i int
        err = rows.Scan(&i)
        rows.Close()
        cancel()
        if err != nil {
            panic(err)
        }
        fmt.Println("Got", i)
        time.Sleep(1 * time.Second)
    }
}

Behaviour

Code above works fine with a local postgres instance up
If you stop the PG instance the connect will be reset and it will fail loudly as expected
Do any of the following to simulate failure:
- simulate process busy/hung by finding the PID of the connected postgres process and then kill -stop <pid>
- simulate packet loss by adding echo "block drop quick on lo0 proto tcp from any to any port = 5432" | sudo pfctl -ef - on OS X or equivalent iptables rule to drop packets to the local database
Expected behavior: that the client's timeout after 5 seconds with an error and print to screen before retrying
Actual Behaviour: the client remains hung indefinitiely in either case until the "failure" is resolved

Cause

It seems to be due to the fact that the code handling Context cancelation (https://github.com/lib/pq/blob/master/conn_go18.go#L67) attempts to send a cancel op to the database and waits forever for a response
The cancel method even calls dial internally but this is not even respecting the connect_timeout option, possibly because it's returning a pooled connection? Even my trivial test example doesn't timeout though and there should only be a single connection made.

Ramifications

In my case the motivation for wanting to use Contexts in the first place is that I need tight bounds on execution time and can't afford to have clients hang indefinitely on DB whatever the error condition is in the real-world (both packet loss and hung server processes happen).

I could implement this externally presumably with my own watcher on the context in another Goroutine that tears down the whole connection, but it seems like this mechanism should work for my case without re-inventing it outside.

Is there an alternative option?

tamird commented 7 years ago

Yep, this is a bug! We need to set connection deadlines using net.Conn.Set{,Read}Deadline in our cancel function and close the underlying connection (the primary, not the secondary one used for cancellation) when they time out. cc @mjibson

maddyblue commented 7 years ago

I'll work on this soon.

maddyblue commented 7 years ago

I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options.

The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future.

I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger?

banks commented 7 years ago

You could test if you control the Postgres process by sending sigstop and sigcont. But yes there are networking options to simulate failures in testing that might be simpler.

I'd say it's at least surprising behaviour as it is and an important missing feature if there is no way to have a deadline on queries that is actually respected in case of a slow database or network failure.

FWIW I did have connect timeout set to 5 seconds and this was not respected in the failure cases described. Perhaps that is the bug?

On 6 Jun 2017, at 08:51, Matt Jibson notifications@github.com wrote:

I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options.

The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future.

I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

banks commented 7 years ago

If the concern is that it might be unsafe to return a connection to pool after cancelling, would it be simplest to treat this as a special case where a connection that hit timeout before getting cancellation response is just discarded as an error/unknown connection state (I.e. Not returned to pool) or left to run for some longish time in background to see if cancellation gets processed eventually?

I'm not intimately familiar with the interface database/sql requires from the driver so that might preclude just discarding connections as I suggest.

On 6 Jun 2017, at 08:51, Matt Jibson notifications@github.com wrote:

I'm no longer certain this is a bug. If the connect_timeout option is passed to the driver, then both connections should set their deadlines correctly. It is unclear if we should always set a deadline on the cancellation connection, but I think it should be the same as the normal connection, and so should use the same options.

The point about the cancel method waiting for a response is accurate, but I think it's also correct. If we didn't wait for a response then we wouldn't know if the server received the cancellation request (this may be ok, since the server won't send anything back, however I'm not certain it's ok to just open a connection, send some bytes, then immediately close it. It is really important that things happen in a correct order (see all the finished chan stuff in watchCancel) to make sure that we do not return a connection to the pool if that connection could get a cancel request later on. So I'm not sure if we are able to safely return very soon after a context has been cancelled because I think the connection could be incorrectly cancelled in the future.

I'm also not sure how to test this, because we need something that doesn't require messing with firewall rules. Maybe a network proxy setup for this function that stops relaying data on a trigger?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

lattamore commented 7 years ago

I have the same issue, but the thread waits forever for the cn.query in QueryContext. There are no client-side network timeouts. If for whatever reason the TCP FIN never makes it to the client the driver is at the mercy of the OS to signal the socket as disconnected. This takes 15+ minutes on my ubuntu box. I am currently using the workaround with github.com/Kount/pq-timeouts. Are there any reasons the driver does not have client side timeouts implemented (IE: databse/sql)?

banks commented 7 years ago

FYI I worked around this issue immediately by using https://github.com/Kount/pq-timeouts which wraps this driver to implement timeouts better.

chrispassas commented 6 years ago

I'm currently experencing this issue as described by @banks. Using pq-timeouts won't help in my case because I'm using a long lived connection pool and have different timeout values per-query. pq-timeouts is only a solution if all queries can use the same timeout value or you create a new connection per-query.

It seems like a bug that if the database your querying is for whatever reason unable to respond that it will hang forever. The whole point of context is to terminate things that run to long.

In my case of PostgreSQL databases are so loaded that at times they can't respond to the cancel query request so db.QueryContext() hangs forever.

Even if I wrap this in a switch statement with timeout I'll be leaking connections that never close.

mwuertinger commented 6 years ago

Any news on this? Sounds like a serious issue and it's almost a year since it was first reported.

importnil commented 5 years ago

Another year has passed, and I've stumbled upon this issue as well.

ShawnXxy commented 5 years ago

There might be a workaround for this:

Adding another layer of PG Bouncer side car above Golang PG client. And set the following TCP keepalive parameters actively detect broken TCP connection: net.ipv4.tcp_keepalive_time=30 net.ipv4.tcp_keepalive_intvl=5 net.ipv4.tcp_keepalive_probes=3 After this change, the connection can fail fast and reconnect when the current TCP connection is broken.

Updates: Below settings work more effective tcp_keepcnt = 3 tcp_keepidle = 1 tcp_keepintvl = 1

t33m commented 4 years ago

Almost 3 years passed and still no support context.WithTimeout() for query...

vtolstov commented 4 years ago

may be https://github.com/jackc/pgx does not have such bug?

narcis96 commented 1 year ago

Hello !

Are there any updated on this issue? We are still facing this critical issue in our system. We would like to know if there is a plan for fixing this issue or we should start using other driver. Any response is more than welcomed.

lib / pq