jackc / pgx

PostgreSQL driver and toolkit for Go
MIT License
10.83k stars 845 forks source link

Connection timeouts interact poorly with multiple targets #2171

Open jwreschnig-utiq opened 1 day ago

jwreschnig-utiq commented 1 day ago

We have been testing pgxpool failover with a primary/standby database setup; one of the scenarios we've tested is:

In this situation, it seems the dial is spending its entire five seconds trying to connect to the primary, and then gives up before trying what would be a successful and preferred connection to the standby. From reading the code I believe this is because the primary is first in the list, not specifically because it's a primary or the preference.

To me this is surprising behavior; I expected something like net.DialContext where the context's timeout is spread over the possible hosts, or perhaps a parallel race dial and picking the best successful result.

I'm not filing as a bug, because I am not 100% sure my read of the problem was correct, and because I understand there's value in aligning with libpq behavior so if libpq also does this, this might be the right default. But, I cannot figure out satisfying a way around this:

Some potential fixes I see would be: