unresponsive agent checks fallback to default tcp check

As the title says, if the agent check daemon becomes unresponsive, haproxy will fallback to performing a tcp check on the server (not the agent-addr) specified. in the case of happac, this becomes critically problematic because a failover during agent check downtime would result in haproxy not reporting the proper master replica - resulting in write requests being routed to standby replicas. While patroni wouldn't allow splitbrain writes, write requests would be lost.

Per the haproxy documentation: https://cbonte.github.io/haproxy-dconv/1.6/configuration.html#5.2-agent-check

Failure to connect to the agent is not considered an error as connectivity is tested by the regular health check which is enabled by the "check" parameter.

Replication:

This is based off of a configuration matching Scenario C using the acl use_backend conditional

Scenario 1

turn off happac on the master node
turn off patroni on the same master node
a failover will occur but haproxy will now report two available primaries because:
1. happac is running on the new primary and reports it as available
2. happac isn't running on the old primary reverting to a tcp check of port 6432 resulting in an up response
The issue of duplicate primaries is resolved once the happac service resumes on the old primary.

Scenario 2 (unlikely)

turn off happac on all 3 nodes (master and 2 standbys)
turn off patroni on the master node
a failover will occur but the primary_pgbouncer backend will still report the old primary, the while the secondary primary_paroni backend will report the correct backend. This is because:
1. happac isn't running on the old primary reverting to a tcp check of port 6432 resulting in an up response
2. happac isn't running on the new primary reverting to a tcp check of port 6432 resulting in an up response
The issue of the old primary being reported as the current primary is resolved once happac processed resume on all servers

Practicality

While Scenario 2 is highly unlikely and almost impossible, Scenario 1 is more likely because it relies on the happac daemon going down followed by a failover occurring while the pgbouncer process is still alive. Issues like resource failures do have the potential of making both happac and patroni unavailable but pgbouncer would also become unavailable in such predicament (pgbouncer being available is a requirement of replicating scenario 1). Additionally, in the context of a heavily monitored environment whereby system administrators would be notified of happac daemon outages, the likelihood of it lingering until a patroni failover is also slim.

Solution

In spite of the unlikeliness of the scenarios above, the solution to all of this is by using pgsql-check as the health check which allows a username parameter. The significance of this is that we can utilize usernames via pg_hba configurations to determine whether or not the server is truly in master or standby mode. The additional advantage of this is foregoing agent-checks altogether further omitting the need for happac (this repository). More details about this here: https://www.percona.com/blog/configure-haproxy-with-postgresql-using-built-in-pgsql-check/

joe-at-startupmedia / happac