joe-at-startupmedia / happac

Ensures that both Patroni and PgBouncer instances are alive for HAProxy
0 stars 0 forks source link

unresponsive agent checks fallback to default tcp check #1

Closed joe-at-startupmedia closed 9 months ago

joe-at-startupmedia commented 10 months ago

As the title says, if the agent check daemon becomes unresponsive, haproxy will fallback to performing a tcp check on the server (not the agent-addr) specified. in the case of happac, this becomes critically problematic because a failover during agent check downtime would result in haproxy not reporting the proper master replica - resulting in write requests being routed to standby replicas. While patroni wouldn't allow splitbrain writes, write requests would be lost.

Per the haproxy documentation: https://cbonte.github.io/haproxy-dconv/1.6/configuration.html#5.2-agent-check

Failure to connect to the agent is not considered an error as connectivity is tested by the regular health check which is enabled by the "check" parameter.

Replication:

This is based off of a configuration matching Scenario C using the acl use_backend conditional

Scenario 1

  1. turn off happac on the master node
  2. turn off patroni on the same master node
  3. a failover will occur but haproxy will now report two available primaries because:
    1. happac is running on the new primary and reports it as available
    2. happac isn't running on the old primary reverting to a tcp check of port 6432 resulting in an up response
  4. The issue of duplicate primaries is resolved once the happac service resumes on the old primary.

Scenario 2 (unlikely)

  1. turn off happac on all 3 nodes (master and 2 standbys)
  2. turn off patroni on the master node
  3. a failover will occur but the primary_pgbouncer backend will still report the old primary, the while the secondary primary_paroni backend will report the correct backend. This is because:
    1. happac isn't running on the old primary reverting to a tcp check of port 6432 resulting in an up response
    2. happac isn't running on the new primary reverting to a tcp check of port 6432 resulting in an up response
  4. The issue of the old primary being reported as the current primary is resolved once happac processed resume on all servers

Practicality

While Scenario 2 is highly unlikely and almost impossible, Scenario 1 is more likely because it relies on the happac daemon going down followed by a failover occurring while the pgbouncer process is still alive. Issues like resource failures do have the potential of making both happac and patroni unavailable but pgbouncer would also become unavailable in such predicament (pgbouncer being available is a requirement of replicating scenario 1). Additionally, in the context of a heavily monitored environment whereby system administrators would be notified of happac daemon outages, the likelihood of it lingering until a patroni failover is also slim.

Solution

In spite of the unlikeliness of the scenarios above, the solution to all of this is by using pgsql-check as the health check which allows a username parameter. The significance of this is that we can utilize usernames via pg_hba configurations to determine whether or not the server is truly in master or standby mode. The additional advantage of this is foregoing agent-checks altogether further omitting the need for happac (this repository). More details about this here: https://www.percona.com/blog/configure-haproxy-with-postgresql-using-built-in-pgsql-check/

joe-at-startupmedia commented 10 months ago

We can solve for this by using 2 seperate acl requirements:

  1. Fallback master frontend to master_patroni backend if there is not exactly 1 master_pgbouncer server up. This accounts for scenarios when HAProxy is reporting multiple primaries (master_pgbouncers) due to the scenarios explained above and other unforeseen scenarios which would result in the same
  2. Fallback master frontend to master_patroni backend if there are not exactly 3 happac backend servers up. As such we create another backend for our happac service which monitors the process running on all 3 servers. When one up them goes down the acl requirement is not met forcing master_patroni backend promotion.

Both of these have been implemented and tested with success.