Detect stuck TCP connections

redbaron commented 1 week ago

Is your feature request related to a problem? Please describe.

When we failover PostgreSQL server to another primary, old connections can remain stuck: for whatever reason they are not closed and attempt to use them has end up with timeout:

write failed: write tcp 10.32.14.13:36102->10.36.54.61:5432: i/o timeout

we see these entries minutes after failover with all connections in connection pool

Describe the solution you'd like

It seems that detecting hung TCP connection is not trivial. TCP keepalive won't help, because with outstanding writes connection is marked as active and no keepalives are set.

One way I can think of is if TCP IO timeout was somehow decoupled from Query context timeout. net.Conn has SetWriteDeadline(t time.Time) error method which on the surface should do the trick: configurable short timeout of 1-3 seconds will be enough to detect IO error and take connection out of the pool still allowing Query context have arbitrary timeout.

Describe alternatives you've considered

Have custom Dialer which sets TCP_USER_TIMEOUT (not unlike tcp_user_timeout connection string param in libpq) with raw sys calls, but it is not portable.

Another options is pgxpool connection health check. It seems to check connection health on Acquire, which underneath executes -- ping statement, but context passed for that check is from Acquire call which in turn itself is from pgxpool.Query so it suffers from the same problem of stuck connection: context timeout is tuned for query execution time overall. If health check was executed with a smaller context timeout derived from parent context then it would detect problematic connections more reliably. This is not a preferred option because it doesn't handle stuck connection detection for already Acquired connections.

jackc commented 5 days ago

As you say, detecting broken TCP connections is non-trivial.

I wonder if a write timeout would even work reliably? As I understand it, from Go's point of view, the write is successful when sent (maybe even only in the OS write buffer), not when acked in some manner. I think a write can succeed even when the underlying connection is broken. That's why the pool health check uses a -- ping statement. AFAIK, nothing lower level is sufficient to guarantee the connection is live.

redbaron commented 4 days ago

I wonder if a write timeout would even work reliably?

I believe it would cover at least our case. Error line seems to failing on write and error itself seems to be from net stack, lowering write timeout should do the trick. Detecting stuck connection in some of the cases is an improvement still.

if going with this approach, we need to be careful with large batch writes, which I suspect do expect to be blocked on write for long.

That's why the pool health check

I think pool healtcheck should use derived context with much shorter timeout regardless of decision made how to proceed with stuck connection detection. Probably that timeout should be configurable, because it differs whether connection is to PG or pgxpool . Our systems use direct connection to PG, so I'd configure it to 1s.

As for the purpose of this issue, pool health check won't help us much: all connections are in constant reuse and pool health check triggers on 1s of inactivity. If TCP connection becomes "stuck" only small subset of pool connections will be lucky enough to hit it at right moment and right state for pool check to trigger AND to catch issue with connection.

jackc / pgx

Detect stuck TCP connections #2163