Multi-node failure requires manual intervention

Hello, we have been testing a variety of scenarios using pg_autofailover and disaster recovery.

If we have a 6 node setup, and we destroy 3 that includes the primary, it takes manual intervention to get out of report_lsn state and promote a primary from the remaining healthy nodes. This was surprising to us.

Is there a reason that we don't have a timeout for this scenario?

Some output from our scenario, for your perusal follows.

_Postgres version: 14.1 pgautofailover version: 1.6.3 OS: centos7

After cutting network access to/from node_1, node_2, and node_3, we see the following:

-bash-4.2$ pg_autoctl show state
  Name |  Node |                               Host:Port |       TLI: LSN |   Connection |      Reported State |      Assigned State
-------+-------+-----------------------------------------+----------------+--------------+---------------------+--------------------
node_1 |     1 | hadr-node-a.c.data-pcf-db.internal:5432 |   1: 0/B002508 | read-write ! |             primary |            draining
node_2 |     2 | hadr-node-b.c.data-pcf-db.internal:5432 |   1: 0/B002508 |  read-only ! |           secondary |           secondary
node_3 |     3 | hadr-node-c.c.data-pcf-db.internal:5432 |   1: 0/B002508 |  read-only ! |           secondary |           secondary
node_4 |     4 | hadr-node-d.c.data-pcf-db.internal:5432 |   1: 0/B002580 |    read-only |          report_lsn |          report_lsn
node_5 |     5 | hadr-node-e.c.data-pcf-db.internal:5432 |   1: 0/B002580 |    read-only |          report_lsn |          report_lsn
node_6 |     6 | hadr-node-f.c.data-pcf-db.internal:5432 |   1: 0/B002580 |    read-only |          report_lsn |          report_lsn

and

-bash-4.2$ pg_autoctl show settings
  Context |    Name |                   Setting | Value
----------+---------+---------------------------+------
formation | default |      number_sync_standbys | 1
  primary |  node_1 | synchronous_standby_names | ''
     node |  node_1 |        candidate priority | 50
     node |  node_2 |        candidate priority | 50
     node |  node_3 |        candidate priority | 50
     node |  node_4 |        candidate priority | 10
     node |  node_5 |        candidate priority | 10
     node |  node_6 |        candidate priority | 0
     node |  node_1 |        replication quorum | true
     node |  node_2 |        replication quorum | true
     node |  node_3 |        replication quorum | true
     node |  node_4 |        replication quorum | true
     node |  node_5 |        replication quorum | true
     node |  node_6 |        replication quorum | true

After waiting a long time (10+ min), we still see:

00:27:36 2066 INFO  Failover still in progress after 3 nodes reported their LSN and we are waiting for 2 nodes to 
report, activeNode is node 6 "node_6" (hadr-node-f.c.data-pcf-db.internal:5432) and reported state "report_lsn"

When we dropped node_2 and node_3 on the monitor, we were able to recover:

pg_autoctl drop node node_2
pg_autoctl drop node node_3

Final State, after a short bit of waiting (from pg_autoctl watch on the monitor):

  Name  Node Quorum Priority       TLI: LSN   Check   Connection  Report      Reported State      Assigned State
node_1     1    yes       50   1: 0/B002508     17s read-write !  32m27s             primary             demoted
node_2     2    yes   50   1: 0/B002508     17s  read-only !  32m28s           secondary             dropped
node_3     3    yes   50   1: 0/B002508     17s  read-only !  32m28s           secondary             dropped
node_4     4    yes       10   2: 0/B002830     17s   read-write      1s             primary             primary
node_5     5    yes   10   2: 0/B002830 6   17s    read-only      1s           secondary           secondary
node_6     6    yes        0   2: 0/B002830 5   17s    read-only      1s           secondary           secondary

Other interesting state: node_1 properly demoted itself (for our testing purposes, we left it up, but had cut ingress/egress network traffic).

So you understand why we are working on these types of tests, we are attempting to find a good multi-site disaster recovery plan. Ideally, we can have asynchronous replication across data-centers, and then cascading replication within each DC to save on network costs and bandwidth. Having a monitor node live outside of those two datacenters is one possibility, but cascading asynchronous replication from a secondary is currently not a pg_autofailover feature (as far we have seen). We were also playing with the idea of having two monitors, where one does not require a 'primary' but keeps one of its node's following a primary outside its management (call it a 'secondary-leader' or 'standby-leader'). The idea being you could "promote" this entire pg_autofailover cluster in the case of disaster in the primary's site, and the second site would make sure one of its nodes is always following the primary (in case its secondary-leader fails).

We'd love to hear your input on this scenario, and these ideas, as well as the issue at hand.

@rheaton & @swati-nair

So the reason is that pg_auto_failover doesn't want to do an automatic failover unless it can guarantee no loss of reported committed data. With your settings it is not able to guarantee that.

The reason is that you are using number_sync_standbys = 1, while having all nodes be part of the replication quorum. So all pg_auto_failover knows for sure is that once the primary failed, one of the 5 standbys would have the most recent changes. But it does not know which it is (the LSN reported to the monitor might be out of date). That's why it says it's still waiting for 2 nodes to report.

If you increase the number_sync_standbys to 3, then the nodes should be able to get out of report_lsn. Because at that point with even with 3 nodes lost, it knows that there's a 4th (number_sync_standbys + 1) one that received the last update. So it can safely promote the furthest node of the 3 nodes that are up.

However, I just tried this locally to confirm. But it doesn't failover either with number_sync_standbys=3. So in addition to your settings preventing failover, there also seems to be a bug.

@JelteF Yeah, we also tried the settings that you recommended i.e. setting number-sync-standbys to 3 with all the standbys participating in replication quorum, but it didn't work.

pg_autoctl show settings
  Context |        Name |                   Setting | Value
----------+-------------+---------------------------+------
formation |     default |      number_sync_standbys | 3
  primary | hadr-node-b | synchronous_standby_names | ''
     node | hadr-node-a |        candidate priority | 50
     node | hadr-node-b |        candidate priority | 50
     node | hadr-node-c |        candidate priority | 50
     node | hadr-node-d |        candidate priority | 10
     node | hadr-node-e |        candidate priority | 10
     node | hadr-node-f |        candidate priority | 0
     node | hadr-node-a |        replication quorum | true
     node | hadr-node-b |        replication quorum | true
     node | hadr-node-c |        replication quorum | true
     node | hadr-node-d |        replication quorum | true
     node | hadr-node-e |        replication quorum | true
     node | hadr-node-f |        replication quorum | true

pg_autoctl watch output

Formation: default - Sync Standbys: 3                                                                                        19:31:28

       Name  Node Quorum Priority       TLI: LSN   Check   Connection  Report      Reported State      Assigned State
hadr-node-a     1    yes       50   2: 0/B0043B8     18s  read-only !   3m27s           secondary           secondary
hadr-node-b     2    yes       50   2: 0/B0043B8     18s read-write !   3m26s             primary            draining
hadr-node-c     3    yes       50   2: 0/B0043B8     18s  read-only !   3m26s           secondary          report_lsn
hadr-node-d     4    yes       10   2: 0/B005E70     18s    read-only      1s          report_lsn          report_lsn
hadr-node-e     5    yes       10   2: 0/B005E70     18s    read-only      2s          report_lsn          report_lsn
hadr-node-f     6    yes        0   2: 0/B005E70     18s    read-only      1s          report_lsn          report_lsn

Also, do you have any thoughts on multi-site disaster recovery plan using pg_auto_failover that we have described in the issue description at the end?

@swati-nair & @rheaton

Hi,

I will have to have a deep look at the issue here when back from vacations in a couple weeks. I already think it's a bug though, so if you can dive into it and beat me to a fix, please consider doing so.

About the larger idea for running multiple regions, the plan I have in mind consists of introducing the notion of a “region” in the pgautofailover.node table on the monitor. The primary node would then by definition live in the primary region. We could then maintain a local “leader” in each region and implement cascading replication for the other nodes in the same region, where the leader is then the upstream for the other nodes. The region leader would be a dynamic role again, it may change at any time, so I suppose this would need to be another state in our FSMs.

With that idea, here is a first list of questions that should be clarified for the design stage, or during the development. If you have opinions, please share them, so that we can discuss the best way forward!

do we need something like the candidate priority for electing region leaders?
the first use-case that comes to mind is that every node is in a different region (say we have 2 nodes in 2 regions, or 3 nodes in 3 regions, which seems common enough to worry about) ; so we might want to allow cross-regions failover by default, trusting the candidate-priority the same as always
is there a use-case for having automated failover always target the “primary region”, and if yes, what kind of tracking and UI do we need to make it happen? more importantly, can this use-case be left aside for a second stage of development?
I would like to also implement “cascading regions”, where any region can be set as the upstream to any other region, allowing full trees to be built, I suppose following geographic boundaries in most cases... we might want to materialise the list of regions on the monitor then (something simple like create table pgautofailover.region(id bigserial, name text, upstream bigint foreign key region(id), position point) maybe, with the position being longitute, latitude so that we can imagine production maps someday),
we should disallow nodes that are not directly connected to the primary from having replication-quorum set to true ; but is there a case to be made where the user can ask for it to be true when the node is the region leader, and set to false if another node is now assigned this role? I suppose we should make that behaviour the default, after having written it, it seems to make sense... and I'm not sure if we'd want another one?

When implementing multi-regions support (or cascading replication) that way, the question will arise where to put the monitor. We should have proper architecture documentation around that decision. I like the idea of a “primary region” and preventing auto-failover to change region for that idea, it makes it obvious that you want your monitor there, in the “primary region”. The case when a user would have 3 regions and deploy a Postgres node per region is not that easily answered, of course.

I don't think we will get to a looking at a fix for this until January (we will be on vacation starting Thursday/Friday).

These ideas for multi-region support are exciting! Going through the list:

I don't have an opinion at this time -- I need to think on it more.
I think permitting cross-region failover is a good idea, although I think the networking between the nodes will get expensive.
I do think there is a good case for keeping failover within a region, if possible -- mostly around speed and networking costs, so I think we should definitely focus on 4. and 5. Having a leader per region that other "local" nodes follow will be optimal so we can reduce networking costs, latency, and (if that leader is synchronously replicating) transaction speed. Allowing the user to select any node as a primary could be good for really serious disasters, but should probably come with a warning in the case of potential data-loss -- depending on the scenario we let the user decide where their priority lies.

I agree on the monitor question, and in the end it would be simpler to attach a running cluster to a new monitor than it is today. I believe this would make this decision less important since one could spin up a new monitor in the case of a disaster. We've been experimenting with this and it's non-trivial to attach a new monitor to existing pgaf nodes (see https://github.com/citusdata/pg_auto_failover/issues/14). If, for example, our main datacenter went down and we wanted a monitor in our disaster recovery location, how would we do this if the monitor was in the main DC? I'm wondering if we should start with making this all work for two regions first, and then see how complicated going beyond that would be. Re-attaching the nodes in the main datacenter, once it comes back up also requires more hands-on changes than feels optimal (e.g. hba edits, hidden cache directory removal, and maybe some other things we haven't figured out).

Also, Happy Holidays! 🎄

hapostgres / pg_auto_failover

Multi-node failure requires manual intervention #858