WIP: Implement passive members for a ra cluster

luos commented 1 year ago

Proposed Changes

Passive members are not voting for any members. Passive members are ignored for consensus purposes. Passive members are replicating all traffic from the leader. Passive members can be forced to be active members in the cluster.

Not handled yet: Active members demoted to passive. Handling Active/Active collisions a.k.a. split-brains. Some smaller changes marked as TODOs.

Working towards issue #44.

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

[ ] Bug fix (non-breaking change which fixes issue #NNNN)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation (correction or otherwise)
[ ] Cosmetics (whitespace, appearance)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

[ ] I have read the CONTRIBUTING.md document
[x] I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
[x] All tests pass locally with my changes
[x] I have added tests that prove my fix is effective or that my feature works
[ ] I have added necessary documentation (if appropriate)
[ ] Any dependent changes have been merged and published in related repositories

luos commented 1 year ago

Hi,

Yes, definitely I can see both ways have some advantages. We did not want to change the cluster map because that is more ephemeral state and this way only the consensus part needs changes and nothing else. TBH, we started with that but gave up for ease of implementation. :-) We also don't want (or at least as the first round of changes) passive peers to be less of a peer than active peers - except that they don't participate in elections.

We decided against introducing a new statem state to avoid having to touch the state machine very deeply and introducing any potential missed code paths.

In this change passive peers are never promoted to be leaders or to active followers automatically. It is not possible to automate with any kind of safety. If passive peers would be promoted automatically during a network partition it would jeopardize the safety and consistency of the raft cluster.

Our change aims at a disaster recovery scenario when recovering the majority is not possible. In case of a DR scenario, though, technically it can happen that the former leader tries to rejoin the cluster which would lead to a split brain scenario.

In such a split brain, there is no guarantee that the term of the old leader is lower than the newly promoted leader's. For example in a 6 node cluster, with the 3 active + 3 passive setup, a network partition could trigger a disaster recovery scenario so the 3 passive nodes are promoted to leader, with term+1. Let's say if we can't shut down the three former active nodes, they could go through two leader elections while separated in their own data center. It would increase their terms by 2, so when they try to rejoin, they would be the preferred partition, which is not what the operator desired during the DR procedure.

Any kind of auto-promotion would handle this scenario which is not possible in a safe way, similarly as RabbitMQ's classic queue mirroring, which suffers from the same problem.

In an ideal world, whenever an active node is demoted to passive, the active node must be deleted, which may or may not be possible during network partitions.

Keeping in mind a disaster recovery scenario, there might not be any active leaders present, hence the usage of the force_change_passive_members function. It must be handled by the local node, even if it's passive and follower, unblock the election procedure, and elect a new leader.

If the cluster is up and healthy, we can introduce a non-forced version of the command, which goes through the leader/consensus. We can include that in a future commit.

illotum commented 1 year ago

It is not possible to automate with any kind of safety. If passive peers would be promoted automatically during a network partition it would jeopardize the safety and consistency of the raft cluster.

Perhaps I explain my use case poorly. I want auto-promotion to avoid joining slow members to the quorum. A node would go through the "catch up" period as a passive follower, and be promoted to full member when its log is close to the leader. If we do it via configuration change identical to adding new cluster member -- the log append, confirms, and the rest -- promotion should be safe. And of course if there is no majority there should be no promotion.

Corollary, I think forcing commands through a minority might be better served by a separate PR (and be more generic). Would be nice to be able to force arbitrary commands.

Overall, I'm no @kjnilsson, but Ra being a fundamental library I'd value versatility and correctness over diff size.

Not to discount your use case BTW, disaster recovery is a very welcome feature for us too! LMK if I can somehow help.

Giving it some thought: is it really necessary to grow the cluster before recovering from disaster? Why not force configuration change to a cluster of one, including election and term tick, and then add fresh members to it? Looking at prior art, that's what etcd does to my understanding.

kjnilsson commented 1 year ago

Perhaps I explain my use case poorly. I want auto-promotion to avoid joining slow members to the quorum. A node would go through the "catch up" period as a passive follower, and be promoted to full member when its log is close to the leader.

I think this is a very reasonable feature and something that just didn't get done originally. I would encourage a separate pull request for this only. It should just be a temporary state flag of the peer that is set by default. The tricky bit would be to determine what "close to the leader" means. I think it would be sufficient to record the current index at the time of adding the new member and when the member reaches that index it can be included in consensus calculations. The assumption is this will help when adding members to systems with very long logs.

The index needs to be recorded in the cluster change command meta data such that we can ensure that members aren't able to flip back and forth between being included in consensus and not (e.g. if a leader election takes place before they have reached the pivot index).

kjnilsson commented 1 year ago

Giving it some thought: is it really necessary to grow the cluster before recovering from disaster? Why not force configuration change to a cluster of one, including election and term tick, and then add fresh members to it? Looking at prior art, that's what etcd does to my understanding.

This is the approach taken in https://github.com/rabbitmq/ra/pull/306

Requiring recovery from a single member simplifies things substantially and reduces the risk of ending up in incomprehensible cluster states.

luos commented 1 year ago

Hi @kjnilsson ,

After reading your comment, do you think there is any chance something like this could be implemented and accepted with multiple passive followers, as in this change?

rabbitmq / ra