Closed jberkus closed 2 months ago
I have done some thinking along these lines. My notes on this:
synchronous_quorum
.synchronous_candidates + [leader]
it can get the xlog position of all but synchronous_quorum
nodes.synchronous_quorum
the admin must force a failover with patronictl failover
to acknowledge the possibility of data loss.A couple of examples to illustrate how it would work:
Simple 3 node cluster:
Same as earlier, but p2 can not be reached:
5 node cluster:
Ants,
That makes sense. It's also pointless to promote a node if there is no candidate sync rep, since that node will still not be able to accept synchronous writes, so that all works.
What I'm still concerned about is maintaining the list of candidate nodes, particually in a "rolling blackout" situation. While you've eliminated a race condition by saying only the master can edit the list of nodes, there's still an inherent danger if the master is in the process of adding and removing nodes from the list and then goes dark (this is not a new problem, Cassandra and similar databases have a lot of logic devoted to this issue, and it's why RAFT requires a fixed cluster size).
A possible amelioration of uncertainty caused by adding and removing nodes would be just to make it slow; make the timeout on either adding or dropping a node from the list 5 minutes, for example. The drawback to that is that it would take a lot longer for a cluster to right itself when the nodes come back.
The synchronous list maintenance is different from maintaining cluster consensus in Cassandra, et al. as Patroni delegates the consensus problem to DCS. Externalizing the consensus allows us to be quite flexible in what constitutes a consensus. We just have to follow ordering constraints to ensure state stored in DCS is always conservative/pessimistic. If increasing redundancy (going from 1 of 2 quorum to 1 of 3 quorum) then replicate first, publish later, if increasing number of candidates (from 1 of 2 to 2 of 3 quorum) then publish first, replicate later.
To put it more formally, given a master with synchronous_stanby_names = ANY k (SyncStandbys)
, SyncSet = SyncStandbys ∪ {master}
, replication_factor=k+1
, and DCS state of any quorum_size of CandidateSet
, then at any point the following invariant must hold:
|CandidateSet ∪ SyncSet| < replication_factor + quorum_size
=> any replication set and any quorum set of specified sizes will have at least one overlap.The user interface for this would generalize nicely over what we have now. The user would have to pick 2 things:
I haven't yet figured out how to migrate current configs over is still unclear, but the following mapping from current settings seems to make sense:
Ants,
Am I understanding this right? It seems like that design would result in increasing k to match the number of replicas, if max_fail isn't also increased.
In steady state parameter values would be:
SyncStandbys = CandidateSet \ {master}
k = clamp(max_fail, min=min_replication_factor, max=|SyncStandbys|)
quorum_size = |CandidateSet| - k
I pushed a prototype of this as #672
Why do you need a list of all nodes, instead of setting
synchronous_standby_names = 'any 1 ( * )'
?
With Postgres 10, support for synchronous quorum commit is a feature, and makes using synchrounous replication to reduce data loss during failover much more practical for Patroni. Here's a draft of how this would potentially work:
any $synchronous_quorum ( list, of, all, nodes )
For example, if synchronous_quorum=1 and there are 4 nodes, the setting on each Postgres would be:
Questions: