rabbitmq / ra

A Multi-Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
830 stars 95 forks source link

Prevent non-voting (new and not-yet-caught-up) replicas from becoming cluster leader in certain scenarios #435

Closed sile closed 6 months ago

sile commented 7 months ago

Proposed Changes

In my understanding, ra doesn't assume that non-voters become the cluster leader. (If the understanding is not correct, self vote by non-voters should be fixed instead to prevent potential quorum violation.)

However, the current implementation sometimes allows non-voters to transition into pre_vote state. For instance, a non-voter can become the leader by calling ra:transfer_leadership/2. This PR addresses this problem by adding checks that prevent non-voters from transitioning into pre_vote state.

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

michaelklishin commented 7 months ago

The OCI build failures are due to the fact that this is an external contribution, so Docker Hub secrets are not injected by Actions.

kjnilsson commented 7 months ago

@illotum - would you mind casting an eye at this?

illotum commented 7 months ago

@sile would you be able to provide reproduction steps that lead to self-election? I'd prefer to try fix it on the election side, right now self-vote is acceptable if it doesn't lead to promotion.

More context:

force_membership_change at al are not part of the log, so they may arrive before follower reaches its current state (during log replay). Additionally these out-of-band actions can be used by ops. For example we may want to promote witness to leader via manual force_shrink.

Overall, during the initial implementation, I tried and abandoned approach that relies on followers' knowledge of their state, for this and similar reasons.

sile commented 7 months ago

@illotum okay, I will share reproduction steps. But, would it be possible for you to review https://github.com/rabbitmq/ra/pull/427 beforehand?

I think that using non_voter is much easier than promotable to reproduce the self-voting problem. However, I cannot use non_voter for reproduction as it seems having another issue subtly relating to this problem.

If I can use non_voter as expected in #427, the reproduction steps would be as follows:

  1. Creates a single member cluster
  2. Adds a non_voter member (named A) to the cluster
  3. Calls ra:tansfer_leadership(A, A)
  4. A becomes the leader
sile commented 7 months ago

BTW, I'm bit confused about the phrase "right now self-vote is acceptable" because the following code in the ra main branch seems to prohibit that: https://github.com/rabbitmq/ra/blob/main/src/ra_server.erl#L1304-L1309

handle_follower(election_timeout,
                #{cfg := #cfg{log_id = LogId},
                  membership := Membership} = State) when Membership =/= voter ->
    ?DEBUG("~ts: follower ignored election_timeout, non-voter: ~p0",
           [LogId, Membership]),
    {follower, State, []};
illotum commented 7 months ago

My bad! We don't want non voters to call for election when they are up to date on state, you're correct.

What I meant by acceptable, is that it still happens when the member is not up to date. As I mentioned above, a voter may be demoted later in the log, but won't know it if cluster gets into election mid-sync. We rely on general Raft election mechanism to not accept those candidates.


Regarding your reproduction. In what situation you expect ra:tansfer_leadership(A, A) to happen without human intent? Membership status was deliberately left malleable to operators and developers, to allow its use in various scenarios.

For example non_voter is designed for permanent "witnesses" and those may be used as a passive backup. Backups need to be promoted back occasionally.

Or in other words, do you have a repro where system gets into undesirable state on its own?


I will have a look at #427 :)

sile commented 7 months ago

@illotum I see. Thank you for your explanation.

In what situation you expect ra:tansfer_leadership(A, A) to happen without human intent? ... Or in other words, do you have a repro where system gets into undesirable state on its own?

I think that the function will always be called with human intent. So, if it is intended behaviour (by ra developpers) that non_voter members could be the cluster leader, it's okay to me (the diff for try_become_leader and force_member_change should be removed).

If so, however, as I wrote in this PR's description, the self-vote by non-voters should be fixed instead I think. By this self-vote, non-voters always get +1 vote (by themselves) while they are excluded in the quorum. Thus, they can become the leader with fewer votes than what Raft algorithm requires. The following reproduction steps show this scenario (note that we need to use #427).

%% Run two Erlang nodes in separate terminals (foo@localhost is voter, and bar@localhost is non_voter)
$ rebar3 shell --sname foo@localhost
$ rebar3 shell --sname bar@localhost

%% Start ra
foo@localhost> ra:start().
bar@localhost> ra:start().

%% Start a single-member cluster.
foo@localhost> ClusterName = dyn_members.
foo@localhost> Machine = {simple, fun erlang:'+'/2, 0}.
foo@localhost> {ok, _, _} =  ra:start_cluster(default, ClusterName, Machine, [{dyn_members, 'foo@localhost'}]).

%% Add a non_voter member
foo@localhost> NonVoterServer = #{id => {dyn_members, 'bar@localhost'}, membership => non_voter}.
foo@localhost> ra:add_member({dyn_members, 'foo@localhost'}, NonVoterServer).
foo@localhost> ra:start_server(default, ClusterName, NonVoterServer, Machine, [{dyn_members,'foo@localhost'}]).

%% Stop foo@localhost node (=> only a non_voter member is running in the cluster)
foo@localhost> q().

%% At this time being, 'bar@localhost' considers 'foo@localhost' is the leader. 
bar@localhost> maps:get(leader_id, element(2,ra:member_overview(dyn_members))).
{dyn_members,foo@localhost}

%% Call `gen_statem:cast({dyn_members,bar@localhost},try_become_leader).`
%% 
%% [NOTE]
%% This is a bit tricky. 
%% But the same scenario could happen in the real world
%% if `ra:transfer_leadership({dyn_members,bar@localhost}, {dyn_members,bar@localhost})` is called 
%% when 'foo@localhost' node is alive, 
%% and the node aborted just after 
%% the leader ra server process called `gen_statem:cast({dyn_members,bar@localhost},try_become_leader)`.
bar@localhost> gen_statem:cast({dyn_members,bar@localhost},try_become_leader).

%% Now, 'bar@localhost' is elected as the leader even if there are no alive voters.
bar@localhost> maps:get(leader_id, element(2,ra:member_overview(dyn_members))).
{dyn_members,bar@localhost}
michaelklishin commented 7 months ago

It was never the intent to make non_voter leaders. If a replica is not allowed to participate in election, how much sense does it make to allow it to become a leader?

This is definitely not a scenario that this was introduced for in RabbitMQ :)

illotum commented 7 months ago

I can see the argument that a library should prohibit actions leading to bad states. In that case I'd recommend adding a "become_voter" API. My original idea was to complement non_voter with the ability to both promote and demote voters, to allow runtime control over membership state.


Regarding the repro, this seems to stem from the way we count required quorum. When you run one voter and one non_voter, you effectively run a Raft cluster of one. Raft cannot guarantee anything when you lose that one (voter) member. If there was no witness, you'd already have no data left.

This is a degraded state and by any means is not how Raft should be operated. As soon as you add another voter to your repro, non_voter loses its ability to self elect.

Edit. Just to reiterate. The only reason I can rely on follower's knowledge of cluster state in election_timeout is that election will not succeed unless follower is up to date on the log. In all other cases, you must also consider situations when a follower is behind and doesn't know its current state.

sile commented 6 months ago

@michaelklishin, @illotum

Thank you for your comments. However, I am a bit confused about what I should do next to proceed with this PR.

I feel that there is some controversy among reviewers about the best way to address this issue. So it may be better that ra team decides the approach to proceed this PR. (it is okay to close this PR in order to hand it over to you if it is desirable.)

Some notes about comments for the repro

Raft cannot guarantee anything when you lose that one (voter) member. If there was no witness, you'd already have no data left.

I think it is normal that the single voter member will restart without data loss.

As soon as you add another voter to your repro, non_voter loses its ability to self elect.

I might misunderstand something, but even if there is another voter member in the repro (unlike foo@localhost, this member does not have to be stopped), the non-voter member will still be able to become the leader. (In this case, bar@localhost will gain 2 votes (one from self-voting and another from the additional voter), and ra_server:required_quorum() will return 2.)

michaelklishin commented 6 months ago

@sile leave the change that is commented on as a legitimate improvement, back out the rest. Then I suspect this PR would have a chance of being accepted. Our team's focus is on chaos testing of Ra, so contributions with less-than-obvious value proposition do not receive much attention.

sile commented 6 months ago

@michaelklishin Thank you for your suggestion (fixed at https://github.com/rabbitmq/ra/pull/435/commits/01afd9b89182b96dcc45aa98d33a830255f20cf5).

Our team's focus is on chaos testing of Ra

Sounds interesting 👍

michaelklishin commented 6 months ago

The RabbitMQ OCI failures are not related to this change: for external contributions, Actions does not inject the necessary secrets.

michaelklishin commented 6 months ago

Merging per discussion with @kjnilsson.

michaelklishin commented 6 months ago

@sile thank you for your ongoing contributions to Ra!

sile commented 6 months ago

Thank you!