patroni / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes
MIT License
6.75k stars 843 forks source link

RFC: design for supporting synchronous quorum commit in Patroni #664

Closed jberkus closed 2 months ago

jberkus commented 6 years ago

With Postgres 10, support for synchronous quorum commit is a feature, and makes using synchrounous replication to reduce data loss during failover much more practical for Patroni. Here's a draft of how this would potentially work:

  1. Clusters would get a new setting, synchronous_quorum(int), defaulting to 0.
  2. If turned on (> 0), all Patroni nodes would set synchronous_standby_names to a quorum of any $synchronous_quorum ( list, of, all, nodes )
  3. The polling cycle would add an extra check to see if new nodes have been added, and if so, update synchronous_standby_names.

For example, if synchronous_quorum=1 and there are 4 nodes, the setting on each Postgres would be:

synchronous_standby_names = 'any 1 ( patroni-0, patroni-1, patroni-2, patroni-3 )'

Questions:

ants commented 6 years ago

I have done some thinking along these lines. My notes on this:

A couple of examples to illustrate how it would work:

Simple 3 node cluster:

Same as earlier, but p2 can not be reached:

5 node cluster:

jberkus commented 6 years ago

Ants,

That makes sense. It's also pointless to promote a node if there is no candidate sync rep, since that node will still not be able to accept synchronous writes, so that all works.

What I'm still concerned about is maintaining the list of candidate nodes, particually in a "rolling blackout" situation. While you've eliminated a race condition by saying only the master can edit the list of nodes, there's still an inherent danger if the master is in the process of adding and removing nodes from the list and then goes dark (this is not a new problem, Cassandra and similar databases have a lot of logic devoted to this issue, and it's why RAFT requires a fixed cluster size).

A possible amelioration of uncertainty caused by adding and removing nodes would be just to make it slow; make the timeout on either adding or dropping a node from the list 5 minutes, for example. The drawback to that is that it would take a lot longer for a cluster to right itself when the nodes come back.

ants commented 6 years ago

The synchronous list maintenance is different from maintaining cluster consensus in Cassandra, et al. as Patroni delegates the consensus problem to DCS. Externalizing the consensus allows us to be quite flexible in what constitutes a consensus. We just have to follow ordering constraints to ensure state stored in DCS is always conservative/pessimistic. If increasing redundancy (going from 1 of 2 quorum to 1 of 3 quorum) then replicate first, publish later, if increasing number of candidates (from 1 of 2 to 2 of 3 quorum) then publish first, replicate later.

To put it more formally, given a master with synchronous_stanby_names = ANY k (SyncStandbys), SyncSet = SyncStandbys ∪ {master}, replication_factor=k+1, and DCS state of any quorum_size of CandidateSet, then at any point the following invariant must hold:

The user interface for this would generalize nicely over what we have now. The user would have to pick 2 things:

I haven't yet figured out how to migrate current configs over is still unclear, but the following mapping from current settings seems to make sense:

jberkus commented 6 years ago

Ants,

Am I understanding this right? It seems like that design would result in increasing k to match the number of replicas, if max_fail isn't also increased.

ants commented 6 years ago

In steady state parameter values would be:

SyncStandbys = CandidateSet \ {master}
k = clamp(max_fail, min=min_replication_factor, max=|SyncStandbys|)
quorum_size = |CandidateSet| - k
ants commented 6 years ago

I pushed a prototype of this as #672

haslersn commented 2 years ago

Why do you need a list of all nodes, instead of setting

synchronous_standby_names = 'any 1 ( * )'

?