We have been testing pgxpool failover with a primary/standby database setup; one of the scenarios we've tested is:
a pool configured with postgres://primary,standby/...&prefer-standby
no global connect timeout, because we have a wide range of acceptable times
instead, we always use context timeouts, e.g. calling pool.Exec or pool.Acquire with a context with a 5 second timeout
blocking all network traffic between the client and primary
the standby is healthy
In this situation, it seems the dial is spending its entire five seconds trying to connect to the primary, and then gives up before trying what would be a successful and preferred connection to the standby. From reading the code I believe this is because the primary is first in the list, not specifically because it's a primary or the preference.
To me this is surprising behavior; I expected something like net.DialContext where the context's timeout is spread over the possible hosts, or perhaps a parallel race dial and picking the best successful result.
I'm not filing as a bug, because I am not 100% sure my read of the problem was correct, and because I understand there's value in aligning with libpq behavior so if libpq also does this, this might be the right default. But, I cannot figure out satisfying a way around this:
Any pool-level connect timeout is too low for some cases and too high for others. My experience is that in general it's confusing and fragile to mix timeouts at multiple levels (pool and query), and we prefer to handle all timeouts via contexts if possible.
If I set a DialFunc to shorten the lifetime myself, it still only sees a single address at a time. Even if I leak my original host string into it, some of those hosts might have resolved to multiple addresses, so I don't know how much to divide up the dial context's timeout.
Retrying at the application level is ineffective; it gets stuck at the first server each time. We would need e.g. random load balancing for this to work.
Some potential fixes I see would be:
spread the timeout transparently, as net.DialContext does - this would probably be a breaking change for some people so might need an explicit flag to enable
pass details about the possible connection targets through to the DialFunc via context values (ugh), to let us implement our own timeout based on the number of possible hosts and preferences
allow passing a second value specifically for connect timeouts with each request, overriding the pool connect timeout (again, this would probably have to go by context value for API compatibility)
We have been testing pgxpool failover with a primary/standby database setup; one of the scenarios we've tested is:
postgres://primary,standby/...&prefer-standby
pool.Exec
orpool.Acquire
with a context with a 5 second timeoutIn this situation, it seems the dial is spending its entire five seconds trying to connect to the primary, and then gives up before trying what would be a successful and preferred connection to the standby. From reading the code I believe this is because the primary is first in the list, not specifically because it's a primary or the preference.
To me this is surprising behavior; I expected something like
net.DialContext
where the context's timeout is spread over the possible hosts, or perhaps a parallel race dial and picking the best successful result.I'm not filing as a bug, because I am not 100% sure my read of the problem was correct, and because I understand there's value in aligning with
libpq
behavior so iflibpq
also does this, this might be the right default. But, I cannot figure out satisfying a way around this:Some potential fixes I see would be:
net.DialContext
does - this would probably be a breaking change for some people so might need an explicit flag to enable