cluster: support cascading replication

sorintlab / stolon

PostgreSQL cloud native High Availability and more.

Apache License 2.0

4.66k stars 447 forks source link

Hey @sgotti! Hope you don't mind me reviving a four year old issue.

I've drafted an approach for supporting cascading replication here: https://github.com/gocardless/stolon/pull/13. The mechanics for this are to have each synchronous standby replicate from the primary, then async standbys replicate evenly from each of the available syncs.

I'm following with our (GoCardless') motivation for this feature, but my key question is how did you envision this working? The draft PR is an opinionated cascading replication setup that works for us, but perhaps we want to provide a more general interface? Something like:

CascadingReplicationLimits []struct{Count, MaxFollowers int}

Where [(1, 1), (1, 2), (0, 3)] would be interpreted as:

(1, 1): 1 database is permitted at most 1 follower (this will apply to the primary, as it must)
(1, 2): 1 database is permitted at most 2 followers (applying to the synchronous standbys, as they must be slaved to the master)
(0, 3): whatever databases remain, they are permitted at most 3 followers

This is significantly more complex than the opinionated version but can generically define a cascading topology that would cater for even the largest (tens of nodes) clusters while fairly sharing the load. The PR I've linked would be expressed as [(1, 1), (0, 0)] in this setup, while the existing stolon behaviour would be [(1, 0)].

I'm happy to do the work for any of the approaches we decide, but would love your feedback on this.

Reducing replication strain on our primary is an important motivation in this change, as sending so many replication streams can cause increased network latency for the primary, impacting workload latency as sync acknowledgements return slower. More than this though, there's a race when everything replicates from the primary that we need to avoid.

Right now, failing over the cluster is at risk of prolonged downtime for us. We run with a single synchronous standby and one async, with stolon configuring both to replicate from the primary. It's possible for the master to write some WAL locally, successfully ship that WAL to the async and - while waiting acknowledgment from the sync - crash, leading to a failover. This happens almost everytime we issue a failkeeper in our staging environment, and the chance of it occurring increases with database activity.

Stolon will now promote the sync, but the original master and async have now forked timelines. Both standbys now require a resync against the newly promoted master, and we'll be unavailable for writes until both are resynced. Hopefully pgrewind works (https://github.com/sorintlab/stolon/pull/644) which will mean downtime is about 30s-1m, but if both fallback to basebackup then we're down for 2hrs 😭

Cascading the replication so async standbys replicate from syncs prevents the async ever running ahead of the timeline we expect to promote, effectively solving this problem. I've tested this on our staging cluster and can confirm this prevents us from having to resync.

@lawrencejones This issue was opened long time ago as a possible enhancement but no one asked for it and we mainly focused on the main stolon purpose (high availability) instead of managing a complex topology of postgres instances.

Coming back to your use case/issue I think it should deserve a dedicated issue since it's not deeply related to cascading replication. I'll continue here (but it'll be bettwe if you just open a new one and I'll continue there copying this answer).

Your issue is known and it's actually a postgres sync repl "issue" related to how it's implemented. I didn't think much about it because we usually have many small (order of GiB) databases (since we try to avoid as much as possible big databases) that usually resync in less than one minute.

We already overcome another issue related to postgres sync repl in commit 87766c982c3fa8fc2ac899165dce690c18f5655f

Here my analysis and proposal:

Let's start with some possible sync repl configurations:

2 keeper with sync repl: this doesn't makes sense as of today since one not alive keeper will block everything. There was a proposal to make a "soft" sync repl feature: disable sync repl when there're not enough minSyncronousStandbys alive. But this is another case that should be covered apart. I put it here to add more context and explain why people should implement sync repl with the below two configurations:
3 keepers with sync repl, minSyncronousStandbys == 1 and maxSyncronousStandbys == 1: will end up with one sync standby and one async stanby and if I got it right this is your case. This is affected buy this issue.
3 keepers with sync repl, minSyncronousStandbys == 1 and maxSyncronousStandbys == 2 another way to handle synchronous replication. It's also affected buy this issue.

Your proposed solution can be summed up to this statement: "make all async standbys replicate from sync standbys instead of the primary". This will fix 2. but not 3. since in that case there're only synchronous standbys.

My proposed solution:

When the primary is considered not alive set all the standbys (sync and async) to disable replication (adding a new field to the db spec)
Wait for them reporting having disabled replication (and the updated XLogPos, timeline etc...).
Now we are sure that they won't receive any wal update and we can choose the best standby as the new primary. We could end up choosing an async standby if it's more in sync than then sync one (strange but this is what happens in your case due to how postgres sync repl works).

In this way in both case 2. and 3. the remaining standby (sync or async) will be at the same or at a lower XLogPos and could become the new standby without the need to rewind or a full resync.

sorintlab / stolon

cluster: support cascading replication #17