Open redbaron opened 1 week ago
As you say, detecting broken TCP connections is non-trivial.
I wonder if a write timeout would even work reliably? As I understand it, from Go's point of view, the write is successful when sent (maybe even only in the OS write buffer), not when acked in some manner. I think a write can succeed even when the underlying connection is broken. That's why the pool health check uses a -- ping
statement. AFAIK, nothing lower level is sufficient to guarantee the connection is live.
I wonder if a write timeout would even work reliably?
I believe it would cover at least our case. Error line seems to failing on write
and error itself seems to be from net stack, lowering write timeout should do the trick. Detecting stuck connection in some of the cases is an improvement still.
if going with this approach, we need to be careful with large batch writes, which I suspect do expect to be blocked on write for long.
That's why the pool health check
I think pool healtcheck should use derived context with much shorter timeout regardless of decision made how to proceed with stuck connection detection. Probably that timeout should be configurable, because it differs whether connection is to PG or pgxpool . Our systems use direct connection to PG, so I'd configure it to 1s
.
As for the purpose of this issue, pool health check won't help us much: all connections are in constant reuse and pool health check triggers on 1s of inactivity. If TCP connection becomes "stuck" only small subset of pool connections will be lucky enough to hit it at right moment and right state for pool check to trigger AND to catch issue with connection.
Is your feature request related to a problem? Please describe.
When we failover PostgreSQL server to another primary, old connections can remain stuck: for whatever reason they are not closed and attempt to use them has end up with timeout:
we see these entries minutes after failover with all connections in connection pool
Describe the solution you'd like
It seems that detecting hung TCP connection is not trivial. TCP keepalive won't help, because with outstanding writes connection is marked as active and no keepalives are set.
One way I can think of is if TCP IO timeout was somehow decoupled from Query context timeout. net.Conn has
SetWriteDeadline(t time.Time) error
method which on the surface should do the trick: configurable short timeout of 1-3 seconds will be enough to detect IO error and take connection out of the pool still allowing Query context have arbitrary timeout.Describe alternatives you've considered
Have custom Dialer which sets
TCP_USER_TIMEOUT
(not unliketcp_user_timeout
connection string param in libpq) with raw sys calls, but it is not portable.Another options is pgxpool connection health check. It seems to check connection health on
Acquire
, which underneath executes-- ping
statement, but context passed for that check is fromAcquire
call which in turn itself is from pgxpool.Query so it suffers from the same problem of stuck connection: context timeout is tuned for query execution time overall. If health check was executed with a smaller context timeout derived from parent context then it would detect problematic connections more reliably. This is not a preferred option because it doesn't handle stuck connection detection for already Acquired connections.