hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.1k stars 115 forks source link

question: Multiple Standbys Architecture with 3 standby nodes, one async #966

Closed bagafoot closed 1 year ago

bagafoot commented 1 year ago

https://pg-auto-failover.readthedocs.io/en/main/_images/arch-three-standby-one-async.svg

This architecture would fit a situation where nodes A, B, and C are deployed in the same data center or availability zone, and node D in another.

What if I lost first data center and all set up except node D is unreachable. How can I promote node D if monitor is unreachable? Is it possible set up second monitor to diferent data center. And will that cluster be able to recover after all this disaster?

gazinur87 commented 1 year ago

up

DimCitus commented 1 year ago

In that case I believe the first thing is to decide if the situation is going to be the new default, or if the former main datacenter is going to be back soon.

One approach that's possible to implement, if you decide that the datacenter when node D is found is now the main one:

  1. provision a new monitor node
  2. replace the monitor for node D to point to the new monitor
  3. promote node D as usual

See https://pg-auto-failover.readthedocs.io/en/main/operations.html#replacing-the-monitor-online for step 2.

Closing for triage, of course consider re-opening if needed.

bagafoot commented 1 year ago

@DimCitus thaks for reply, I tried this, but getting this error pg_autoctl enable monitor postgres://autoctl_node@PAF06S04:5432/pg_auto_failover --pgdata /var/lib/pgsql/13/data 15:35:55 1712634 ERROR Monitor ERROR: node PAF06S03:5433 can not be registered in state wait_standby, it should be in state single 15:35:55 1712634 ERROR SQL query: SELECT * FROM pgautofailover.register_node($1, $2, $3, $4, $5, $6, $7, $8, $9::pgautofailover.replication_state, $10, $11, $12, $13) 15:35:55 1712634 ERROR SQL params: 'default', 'PAF06S03', '5433', 'postgres', 'node_3', '7170990576855375595', '3', '0', 'wait_standby', 'standalone', '50', 'true', 'default' 15:35:55 1712634 ERROR Failed to register node PAF06S03:5433 in group 0 of formation "default" with initial state "wait_standby", see previous lines for details 15:35:55 1712634 ERROR Failed to register to the monitor it is nonesense when main data center is down include monitor and single(master) node. There is need promote async pg node that in second datacenter in case catastrophy that we lose first main datacenter. It would so nice if you add this feature, thanks.