Open rheaton opened 2 years ago
So the reason is that pg_auto_failover doesn't want to do an automatic failover unless it can guarantee no loss of reported committed data. With your settings it is not able to guarantee that.
The reason is that you are using number_sync_standbys = 1
, while having all nodes be part of the replication quorum. So all pg_auto_failover knows for sure is that once the primary failed, one of the 5 standbys would have the most recent changes. But it does not know which it is (the LSN reported to the monitor might be out of date). That's why it says it's still waiting for 2 nodes to report.
If you increase the number_sync_standbys to 3, then the nodes should be able to get out of report_lsn. Because at that point with even with 3 nodes lost, it knows that there's a 4th (number_sync_standbys + 1) one that received the last update. So it can safely promote the furthest node of the 3 nodes that are up.
However, I just tried this locally to confirm. But it doesn't failover either with number_sync_standbys=3. So in addition to your settings preventing failover, there also seems to be a bug.
@JelteF Yeah, we also tried the settings that you recommended i.e. setting number-sync-standbys to 3 with all the standbys participating in replication quorum, but it didn't work.
pg_autoctl show settings
Context | Name | Setting | Value
----------+-------------+---------------------------+------
formation | default | number_sync_standbys | 3
primary | hadr-node-b | synchronous_standby_names | ''
node | hadr-node-a | candidate priority | 50
node | hadr-node-b | candidate priority | 50
node | hadr-node-c | candidate priority | 50
node | hadr-node-d | candidate priority | 10
node | hadr-node-e | candidate priority | 10
node | hadr-node-f | candidate priority | 0
node | hadr-node-a | replication quorum | true
node | hadr-node-b | replication quorum | true
node | hadr-node-c | replication quorum | true
node | hadr-node-d | replication quorum | true
node | hadr-node-e | replication quorum | true
node | hadr-node-f | replication quorum | true
pg_autoctl watch output
Formation: default - Sync Standbys: 3 19:31:28
Name Node Quorum Priority TLI: LSN Check Connection Report Reported State Assigned State
hadr-node-a 1 yes 50 2: 0/B0043B8 18s read-only ! 3m27s secondary secondary
hadr-node-b 2 yes 50 2: 0/B0043B8 18s read-write ! 3m26s primary draining
hadr-node-c 3 yes 50 2: 0/B0043B8 18s read-only ! 3m26s secondary report_lsn
hadr-node-d 4 yes 10 2: 0/B005E70 18s read-only 1s report_lsn report_lsn
hadr-node-e 5 yes 10 2: 0/B005E70 18s read-only 2s report_lsn report_lsn
hadr-node-f 6 yes 0 2: 0/B005E70 18s read-only 1s report_lsn report_lsn
Also, do you have any thoughts on multi-site disaster recovery plan using pg_auto_failover that we have described in the issue description at the end?
@swati-nair & @rheaton
Hi,
I will have to have a deep look at the issue here when back from vacations in a couple weeks. I already think it's a bug though, so if you can dive into it and beat me to a fix, please consider doing so.
About the larger idea for running multiple regions, the plan I have in mind consists of introducing the notion of a “region” in the pgautofailover.node
table on the monitor. The primary node would then by definition live in the primary region. We could then maintain a local “leader” in each region and implement cascading replication for the other nodes in the same region, where the leader is then the upstream for the other nodes. The region leader would be a dynamic role again, it may change at any time, so I suppose this would need to be another state in our FSMs.
With that idea, here is a first list of questions that should be clarified for the design stage, or during the development. If you have opinions, please share them, so that we can discuss the best way forward!
create table pgautofailover.region(id bigserial, name text, upstream bigint foreign key region(id), position point)
maybe, with the position being longitute, latitude so that we can imagine production maps someday),When implementing multi-regions support (or cascading replication) that way, the question will arise where to put the monitor. We should have proper architecture documentation around that decision. I like the idea of a “primary region” and preventing auto-failover to change region for that idea, it makes it obvious that you want your monitor there, in the “primary region”. The case when a user would have 3 regions and deploy a Postgres node per region is not that easily answered, of course.
I don't think we will get to a looking at a fix for this until January (we will be on vacation starting Thursday/Friday).
These ideas for multi-region support are exciting! Going through the list:
I agree on the monitor question, and in the end it would be simpler to attach a running cluster to a new monitor than it is today. I believe this would make this decision less important since one could spin up a new monitor in the case of a disaster. We've been experimenting with this and it's non-trivial to attach a new monitor to existing pgaf nodes (see https://github.com/citusdata/pg_auto_failover/issues/14). If, for example, our main datacenter went down and we wanted a monitor in our disaster recovery location, how would we do this if the monitor was in the main DC? I'm wondering if we should start with making this all work for two regions first, and then see how complicated going beyond that would be. Re-attaching the nodes in the main datacenter, once it comes back up also requires more hands-on changes than feels optimal (e.g. hba edits, hidden cache directory removal, and maybe some other things we haven't figured out).
Also, Happy Holidays! 🎄
Hello, we have been testing a variety of scenarios using pg_autofailover and disaster recovery.
If we have a 6 node setup, and we destroy 3 that includes the primary, it takes manual intervention to get out of
report_lsn
state and promote a primary from the remaining healthy nodes. This was surprising to us.Is there a reason that we don't have a timeout for this scenario?
Some output from our scenario, for your perusal follows.
_Postgres version: 14.1 pgautofailover version: 1.6.3 OS: centos7
After cutting network access to/from node_1, node_2, and node_3, we see the following:
and
After waiting a long time (10+ min), we still see:
When we dropped node_2 and node_3 on the monitor, we were able to recover:
Final State, after a short bit of waiting (from
pg_autoctl watch
on the monitor):Other interesting state: node_1 properly demoted itself (for our testing purposes, we left it up, but had cut ingress/egress network traffic).
So you understand why we are working on these types of tests, we are attempting to find a good multi-site disaster recovery plan. Ideally, we can have asynchronous replication across data-centers, and then cascading replication within each DC to save on network costs and bandwidth. Having a monitor node live outside of those two datacenters is one possibility, but cascading asynchronous replication from a secondary is currently not a pg_autofailover feature (as far we have seen). We were also playing with the idea of having two monitors, where one does not require a 'primary' but keeps one of its node's following a primary outside its management (call it a 'secondary-leader' or 'standby-leader'). The idea being you could "promote" this entire pg_autofailover cluster in the case of disaster in the primary's site, and the second site would make sure one of its nodes is always following the primary (in case its
secondary-leader
fails).We'd love to hear your input on this scenario, and these ideas, as well as the issue at hand.
@rheaton & @swati-nair