RFC: design for supporting synchronous quorum commit in Patroni

jberkus commented 6 years ago

With Postgres 10, support for synchronous quorum commit is a feature, and makes using synchrounous replication to reduce data loss during failover much more practical for Patroni. Here's a draft of how this would potentially work:

Clusters would get a new setting, synchronous_quorum(int), defaulting to 0.
If turned on (> 0), all Patroni nodes would set synchronous_standby_names to a quorum of any $synchronous_quorum ( list, of, all, nodes )
The polling cycle would add an extra check to see if new nodes have been added, and if so, update synchronous_standby_names.

For example, if synchronous_quorum=1 and there are 4 nodes, the setting on each Postgres would be:

synchronous_standby_names = 'any 1 ( patroni-0, patroni-1, patroni-2, patroni-3 )'

Questions:

We don't have cluster settings, patroni settings are per-node. If some nodes had synchronous_quorum turned on, and other nodes did not, it could result in unexpected behavior. Is this a real problem?
In the kubernetes-native version, we no longer maintain a list of all nodes in the DCS. Presumably we would need to add one to the ConfigMap for this to work? Or is it sufficient to estimate based on the statefulset size?
How do we want to handle nodes dying or being removed? Do we want to update the list of replicas? If so, how do we want to handle this in order to not create a data-losing race condition? If we don't update automatically, how do we want to enable administrators to do a cleanup?
Will admins be able to push a config change of synchronous_quorum=0 to a single remaining node, in the event that they're down to one node and rescuing the cluster?

ants commented 6 years ago

I have done some thinking along these lines. My notes on this:

The list of nodes currently considered synchronous should be stored in SyncState in DCS. As would be the value of synchronous_quorum.
A node can consider itself healhy if:
1. It is considered a synchronous candidate by SyncState
2. From list of synchronous_candidates + [leader] it can get the xlog position of all but synchronous_quorum nodes.
Leader manages the list of synchronous candidates. Compare-and-swap semantics avoid issues with races.
If master can not be elected due to synchronous_quorum the admin must force a failover with patronictl failover to acknowledge the possibility of data loss.

A couple of examples to illustrate how it would work:

Simple 3 node cluster:

leader=p0 candidates=p1,p2 quorum=1
p0 fails
p1 attempts to contact p0 and p2.
- p0 times out
- p2 can be reached.
p1 can promote itself if xlog position is not behind of p2.

Same as earlier, but p2 can not be reached:

p1 can not promote itself, because p2 might have latest transaction.

5 node cluster:

leader=p0 candidate=p1,p2,p3,p4 quorum=2
p0 and p2 fail
p1 attempts to contact p0,p2,p3,p4
- Gets response from p3 and p4
Because atleast 2 of {p1,p2,p3,p4} have latest commit and we know xlog positions of {p1,p3,p4} we can consider ourselves able to promote.

jberkus commented 6 years ago

Ants,

That makes sense. It's also pointless to promote a node if there is no candidate sync rep, since that node will still not be able to accept synchronous writes, so that all works.

What I'm still concerned about is maintaining the list of candidate nodes, particually in a "rolling blackout" situation. While you've eliminated a race condition by saying only the master can edit the list of nodes, there's still an inherent danger if the master is in the process of adding and removing nodes from the list and then goes dark (this is not a new problem, Cassandra and similar databases have a lot of logic devoted to this issue, and it's why RAFT requires a fixed cluster size).

A possible amelioration of uncertainty caused by adding and removing nodes would be just to make it slow; make the timeout on either adding or dropping a node from the list 5 minutes, for example. The drawback to that is that it would take a lot longer for a cluster to right itself when the nodes come back.

ants commented 6 years ago

The synchronous list maintenance is different from maintaining cluster consensus in Cassandra, et al. as Patroni delegates the consensus problem to DCS. Externalizing the consensus allows us to be quite flexible in what constitutes a consensus. We just have to follow ordering constraints to ensure state stored in DCS is always conservative/pessimistic. If increasing redundancy (going from 1 of 2 quorum to 1 of 3 quorum) then replicate first, publish later, if increasing number of candidates (from 1 of 2 to 2 of 3 quorum) then publish first, replicate later.

To put it more formally, given a master with synchronous_stanby_names = ANY k (SyncStandbys), SyncSet = SyncStandbys ∪ {master}, replication_factor=k+1, and DCS state of any quorum_size of CandidateSet, then at any point the following invariant must hold:

|CandidateSet ∪ SyncSet| < replication_factor + quorum_size => any replication set and any quorum set of specified sizes will have at least one overlap.

The user interface for this would generalize nicely over what we have now. The user would have to pick 2 things:

Maximum number of nodes that can fail at once before quorum is lost.
Minimum replication factor.

I haven't yet figured out how to migrate current configs over is still unclear, but the following mapping from current settings seems to make sense:

synchronous_mode=true, synchronous_mode_strict=false -> max_fail=1, min_replication_factor=0
synchronous_mode=true, synchronous_mode_strict=true -> max_fail=1, min_replication_factor=1

jberkus commented 6 years ago

Ants,

Am I understanding this right? It seems like that design would result in increasing k to match the number of replicas, if max_fail isn't also increased.

ants commented 6 years ago

In steady state parameter values would be:

SyncStandbys = CandidateSet \ {master}
k = clamp(max_fail, min=min_replication_factor, max=|SyncStandbys|)
quorum_size = |CandidateSet| - k

ants commented 6 years ago

I pushed a prototype of this as #672

haslersn commented 2 years ago

Why do you need a list of all nodes, instead of setting

synchronous_standby_names = 'any 1 ( * )'

?

patroni / patroni

RFC: design for supporting synchronous quorum commit in Patroni #664