rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Recovery mode in a new force_restart_server call/3 #308

Closed erlmachinedev closed 4 months ago

erlmachinedev commented 2 years ago

Proposed Changes

The raft protocol itself is built around the notion of a leader which handles cluster change requests and elected by the majority which forms the cluster.

There is a use case when the majority nodes is gone (during network split or outage) and we still need data back. Especially in the scenario when we have inter datacenter communication and our requirement is to keep the minority group running after the incident.

This PR introduces the next calls:

force_restart_server(System, ServerId, FilterNodes) force_restart_server(System, ServerId, FilterNodes, AddConfig)

Which indicate in the client codebase that cluster is forcefully reduced to the setup which is under FilterNodes inclusion list. We tried to make minimal adjustment to not break the general approach behind library code and Ra protocol itself.

The state machine is left deterministic and the all improvement is built behind filter_nodes setting which is passed through mutable config during the restart. The overhead is minimal since validate_cluster/1 is to be called 3 times: during the recover, receiving snapshot or processing cluster change command.

Also this PR includes enq_drain_recovery in addition to the default enq_drain_basic test SUITE which plays the split and recovery scenario into nemesis.

The respective documentation is added (plus small fixes) and one more sanity test.

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further Comments

P.S. We see quite fresh and interesting improvement from @kjnilsson but still need multi DC setup running which would allow us to handle quorum queues in multiple node setup as well.

Also we took into account this issue.

pivotal-cla commented 2 years ago

@erlmachinedev Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla commented 2 years ago

@erlmachinedev Thank you for signing the Contributor License Agreement!

kjnilsson commented 2 years ago

Thanks for this. I've taken a look and my main concern is that this allows you to change the effective membership on a temporary basis only (unless you do a membership change at which point it ends up persisting the filtered cluster).

Say in a cluster A, B, C, D, E where D and E for a partitioned minority. You force boot them with a filtered cluster of [D, E] where [A, B, C] keep making independent progress. Run for a while and then connectivity is restored and then what happens? You now have two very different histories that cannot be merged (you cannot simply restart D, E without the filter) so you are going to have to manually pick one and delete any nodes with a different history and remove and re-add them.

Now the same applies to my Pr #306 but in this case the membership change is made permanent in a manually changed member that you then use to seed the rest of the cluster from. This could also be used to restore availability on a minority side.

The point is I don't see any benefit with having the membership temporary as there is no way to remove the filter safely once it is in place.

kjnilsson commented 2 years ago

If the goal is to list some node that the cluster won't contact on boot, we should rename the setting to something like "nodes expected to be unavailable".

@kjnilsson and I had a different approach in mind but never finished it. I wouldn't say it was drastically different so we can see if this PR can be polished to a mergeable state.

I am reluctant to merge this without further discussing the issues I outlined in my comment above.

luos commented 2 years ago

Thank you for reviewing these changes.

We will definitely change any kind of naming to be a better one as needed. :)

Our primary goal is to recover service with as many messages recovered as possible, after it was made sure that the nodes lost can not be recovered. That is, we have a side channel to verify that those nodes are forever lost. After the cluster is recovered from minority, we would like to also grow the cluster back to the original size (may or may not be with the original nodenames).

It would be good to be able to restore from any number of minority nodes. In a 5 node cluster this may mean 1 or 2 nodes. The reason we'd like to have 2 node recovery is that it is possible for the Followers to be behind the Leader in different amounts, so we'd like to give it the best chance to recover as much data as possible.

The filter_nodes parameter in this PR expects the list of nodes which are considered still part of the cluster. The thinking here is that from RabbitMQ we can provide the list of nodes we still consider cluster members. In conjunction with forget_cluster_node therefore we can forget nodes which are lost. This way we can rely on which nodes mnesia considers cluster members.

I checked #306 out as well, and I think I got what you are trying to do there. I think that could also be changed to accept a list of valid nodes, therefore electing the most up to date Follower, though as I understand, for that we would need to keep the commit index and similar data instead of resetting. The reason we did not really want to do this is that we have to coordinate the changes with all "up" nodes somehow, and in this PR we rely on mnesia for that. I am guessing we would need to have a multi step procedure of setting new cluster members on all peers, then force election.

What we were also going for is easy usage from RabbitMQ. As you can see, you can just forget_cluster_node each down node, then restart the nodes with this change, recovering the queues. This should (in theory) protect against "lost" nodes rejoining the cluster, though in our testing we ran into some issues with that - and nodes rejoined the mnesia cluster automatically.

You are correct that this change would "remember" these lost nodes forever. We could not come up with a good solution to that. Maybe we can have a command to force a cluster member change into the ra log? We would like to ask your opinion and would be happy to implement. We'd prefer if the lost node would be forgotten.

In our testing if a node which was "lost" came back, definitely some weird behaviour of reelecting leaders were seen. I think the cluster will need to reject these peers in some way.

A drawback of our changes is that it requires a restart - but we think that is acceptable in this scenario. Another may be that the filter_nodes param is pre-set on restart, so it should be only set after the cluster is fully built.

luos commented 1 year ago

Hi @kjnilsson , do you have any suggestions on how to proceed on this?

Maybe we could lift your strategy of setting the cluster nodes with a call, but instead of resetting the current state, we could keep it? That would make it possible to elect the most up to date member.

kjnilsson commented 4 months ago

Closing as we already have a recover mode that works ok.