hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.07k stars 112 forks source link

2 nodes + witness = 3 data centers (problem case detected) #997

Open sgrinko opened 1 year ago

sgrinko commented 1 year ago

Hi, Thank you developers for your work!

Now about the problem :)

There are 3 DCs:

In the current configuration, we have synchronous replication between nodes.

If we break the connection between nodes, but witness successfully sees each node (other DCs), then synchronous replication is not automatically removed. This causes requests to hang on the commit command. We cannot wait until lag is accumulated for the witness response.

Is it possible to respond to such a failure of network availability?

xinferum commented 1 year ago

Good afternoon, developers.

This moment is very important and critical for us, as it does not allow us to place the monitor in the third data center at the moment.

Perhaps, as an option, you need keeper to check for replicas connected to it on the primary data node, and if there is none, report this to the monitor and switch replication from synchronous to asynchronous mode.

dimitri commented 9 months ago

It should be possible to see a missing row in pg_stat_replication on the primary node and assign wait_primary from there.