rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
798 stars 93 forks source link

Fix a bug where the leader is never elected even if the majority of the members are alive. #442

Closed sile closed 1 month ago

sile commented 1 month ago

Proposed Changes

This PR addresses the issue reported in #439.

To summarize #439, if there is a candidate member and a pre_vote member where the pre_vote member has a higher log index than the candidate member, neither of them can ever be elected as the leader. (This holds true even if there are additional N / 2 - 1 or fewer followers without election timers, where N is the cluster size.)

This PR adds a branch to ra_server:handle_candidate(#pre_vote_rpc{}, ...) to handle cases where the pre_vote member has a higher log index. By the new branch, when such a message is received, the candidate transitions to the follower state.

I think this is somewhat ad-hoc. However, since I don't know much about the ra code base (especially regarding the role of the pre_vote state), I made a patch to minimize the impact range. Feel free to suggest any better alternative approaches.

Closes #439.

FYI

By applying the patch for reproduction from issue #439 to this PR branch, the execution result became as follows:

(foo@localhost)1> repro:run().
# create cluster
* [repro_a] init
* [repro_c] init
* [repro_b] init
* [repro_a] state_enter: recover
* [repro_c] state_enter: recover
* [repro_b] state_enter: recover
* [repro_a] state_enter: recovered
* [repro_c] state_enter: recovered
* [repro_b] state_enter: recovered
* [repro_a] state_enter: follower
* [repro_c] state_enter: follower
* [repro_b] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_a] state_enter: follower
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader  # new leader is elected
* [repro_a] state_enter: await_condition
* [repro_a] state_enter: follower

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

michaelklishin commented 1 month ago

@sile thank you for your ongoing contributions! We will get to this PR in the next couple of weeks.

sile commented 1 month ago

@kjnilsson That sounds reasonable. Thank you for your comment!

I have incorporated the change in commit e6bbd1a. Here are the current results of the repro function:

> repro:run().
# create cluster
* [repro_b] init
* [repro_c] init
* [repro_a] init
* [repro_b] state_enter: recover
* [repro_c] state_enter: recover
* [repro_a] state_enter: recover
* [repro_b] state_enter: recovered
* [repro_c] state_enter: recovered
* [repro_a] state_enter: recovered
* [repro_b] state_enter: follower
* [repro_c] state_enter: follower
* [repro_a] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_a] state_enter: follower  # Unlike before the commit e6bbd1a, repro_a becomes follower after repro_c becomes candidate.
* [repro_c] state_enter: leader
* [repro_a] state_enter: await_condition
* [repro_a] state_enter: follower
sile commented 1 month ago

Thank you for reviewing and merging this PR!