2 nodes + witness = 3 data centers (problem case detected)

sgrinko commented 1 year ago

Hi, Thank you developers for your work!

Now about the problem :)

There are 3 DCs:

2 of them have nodes (primary and secondary)
in the 3rd DC we have witness

In the current configuration, we have synchronous replication between nodes.

If we break the connection between nodes, but witness successfully sees each node (other DCs), then synchronous replication is not automatically removed. This causes requests to hang on the commit command. We cannot wait until lag is accumulated for the witness response.

Is it possible to respond to such a failure of network availability?

xinferum commented 1 year ago

Good afternoon, developers.

This moment is very important and critical for us, as it does not allow us to place the monitor in the third data center at the moment.

Perhaps, as an option, you need keeper to check for replicas connected to it on the primary data node, and if there is none, report this to the monitor and switch replication from synchronous to asynchronous mode.

dimitri commented 9 months ago

It should be possible to see a missing row in pg_stat_replication on the primary node and assign wait_primary from there.

hapostgres / pg_auto_failover

2 nodes + witness = 3 data centers (problem case detected) #997