There is a possibility that the leader is never elected even if the majority of the members are alive

sile commented 1 month ago

Describe the bug

Let me report an issue we encountered while operating our service that uses ra.

We were operating a 7-node cluster and stopped 3 of them for maintenance. After stopping the 3 nodes, our service became unavailable due to the absence of the Raft leader. It seemed that leader elections were executed periodically, but a new leader was never elected until we restarted member nodes.

I think that, as shown in the following reproduction steps section, this is a subtle bug relating to the pre_vate state (ra original state which is not defined in the Raft paper), and how to fix this is not immediately obvious. Therefore, I think it would be better to leave the resolution of this issue to the ra dev team. However, since this is a critical issue for us, I am willing to create a PR if ra team does not have enough resources to address this issue.

Reproduction steps

Simplified scenario where this issue could occur

I guess a scenario like the following occurred:

There is a ra cluster consists of 3 members named a, b, and c
- c is the leader with term N and log index M (where N and M are arbitrary integers)
- a and b are in follower state
For some reason, a transitions to pre_vote state:
1. a broadcasts #pre_vote_rpc{ term = N }
2. b replies #pre_vote_result{ term = N, vote_granted = true } to a
3. a transitions to candidate state with term N + 1
4. a broadcasts #request_vote_rpc{ term = N + 1 }
c processes a command:
1. c increases local log index to M + 1, and broadcasts #append_entries_rpc{ term = N }
2. b increases local log index to M + 1, and replies #append_entries_reply{ term = N, success = true } to c
3. a rejects the RPC as a has a greater term than c (i.e., the local log index of c does not increase here)
c and b receive #request_vote_rpc{ term = N + 1 } from a (this message was sent during step 2-4):
1. c transitions to follower state (as c has an smaller term)
2. c replies #request_vote_result{ vote_granded = false } as the local log index of c is higher than a
3. b replies #request_vote_result{ vote_granded = false } as the local log index of b is higher than a
  - => Repeatedly, a initiates new elections but is never chosen as the next leader because a has a smaller log index
For some reason, b is stopped
By election timeout, c transitions to pre_vote state:
1. c broadcasts #pre_vote_rpc{ term = N_ }
  - Where N_ is an integer larger than N
  - N_ is incremented by a each time a initiates a new election
2. a ignores #pre_vote_rpc{ term = N_ } as a is in candidate state and a's term is always equal to or larger than N_
3. c cannot transition to candidate state as there are not majority votes
4. After the election timeout period has elapsed, c repeats step 6.
There is no leader until humans take action (e.g., node restart)
- a remains in candidate state (with a shorter log index than c)
- c alternates between follower and pre_vote states (with a term equal to or smaller than a's term)

Commands and a patch for reproduction

Please execute the following commands to reproduce the scenario described above. (The reproduction rate is not 100%, but it is high in my environment.)

// Clone ra and checkut v2.10.1
$ git clone https://github.com/rabbitmq/ra.git
$ cd ra/
$ git checkout v2.10.1

// Apply the patch shown below
//
// [NOTE]
// This patch modifies the ra code, but only adjusts the execution and communication timing
// to make it easier to reproduce the issue.
// For example, it introduces a communication delay between two members.
$ git apply /path/to/ra.patch

// Start Erlang shell
$ rebar3 shell --sname foo@localhost

// Run a function to reproduce the above scenario
//
// [NOTE]
// The member names are `repro_a`, `repro_b`, and `repro_c`.
// They are respectively associated with `a`, `b`, and `c` in the scenario.
(foo@localhost)1> repro:run().
# create cluster
* [repro_c] init
* [repro_b] init
* [repro_a] init
* [repro_c] state_enter: recover
* [repro_a] state_enter: recover
* [repro_b] state_enter: recover
* [repro_c] state_enter: recovered
* [repro_a] state_enter: recovered
* [repro_b] state_enter: recovered
* [repro_c] state_enter: follower
* [repro_a] state_enter: follower
* [repro_b] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: follower
... repeat forever ...

ra.patch

diff --git a/src/ra_server.erl b/src/ra_server.erl
index 7fb5931..1d0fd57 100644
--- a/src/ra_server.erl
+++ b/src/ra_server.erl
@@ -984,7 +984,7 @@ handle_pre_vote(#pre_vote_result{term = Term, vote_granted = true,
                                  token = Token},
                 #{current_term := Term,
                   votes := Votes,
-                  cfg := #cfg{log_id = LogId},
+                  cfg := #cfg{id = Id, log_id = LogId},
                   pre_vote_token := Token,
                   cluster := Nodes} = State0) ->
     ?DEBUG("~ts: pre_vote granted ~w for term ~b votes ~b",
@@ -993,6 +993,24 @@ handle_pre_vote(#pre_vote_result{term = Term, vote_granted = true,
     State = update_term(Term, State0),
     case required_quorum(Nodes) of
         NewVotes ->
+            case Id of
+                {repro_a, _} ->
+                    %% Ensure that the log lengths of `repro_c` and `repro_b` are greater than that of `repro_a`
+                    %% to prevent `repro_a` from becoming the new leader.
+                    %% (NOTE: `#append_entries_rpc{}` from `repro_c` to `repro_a` is delayed.)
+                    {ok, ok, _} = ra:process_command({repro_c, node()}, hello),
+
+                    %% NOTE:
+                    %% `repro_c` will transition from leader to follower at some point later
+                    %% because `repro_a` will become `candidate` and increment the term.
+
+                    %% Stop `repro_b`.
+                    %% This ensure that it's mandatory for `repro_c` to gain a vote from
+                    %% `repro_a` to be re-elected as the leader.
+                    ok = ra:stop_server(default, {repro_b, node()});
+                _ ->
+                    ok
+            end,
             call_for_election(candidate, State);
         _ ->
             {pre_vote, State#{votes => NewVotes}, []}
diff --git a/src/ra_server_proc.erl b/src/ra_server_proc.erl
index 789d3cf..75394fc 100644
--- a/src/ra_server_proc.erl
+++ b/src/ra_server_proc.erl
@@ -342,11 +342,31 @@ do_init(#{id := Id,
                    low_priority_commands = ra_ets_queue:new(),
                    server_state = ServerState},
     ok = net_kernel:monitor_nodes(true, [nodedown_reason]),
+    put(delayed_sender, spawn_link(fun() -> delayed_send(queue:new()) end)),
     State.

 %% callback mode
 callback_mode() -> [state_functions, state_enter].

+delayed_send(Queue0) ->
+    Now = erlang:monotonic_time(millisecond),
+    Timeout =
+        case queue:peek(Queue0) of
+            empty ->
+                infinity;
+            {value, {SendTime, _, _}} ->
+                max(0, SendTime - Now)
+        end,
+    receive
+        {send, Delay, To, Msg} ->
+            Queue1 = queue:in({Now + Delay, To, Msg}, Queue0),
+            delayed_send(Queue1)
+    after Timeout ->
+            {{value, {_, To, Msg}}, Queue1} = queue:out(Queue0),
+            To ! Msg,
+            delayed_send(Queue1)
+    end.
+
 %%%===================================================================
 %%% State functions
 %%%===================================================================
@@ -1630,6 +1650,20 @@ reject_command(Pid, Corr, #state{leader_monitor = _Mon} = State) ->
 maybe_persist_last_applied(#state{server_state = NS} = State) ->
      State#state{server_state = ra_server:persist_last_applied(NS)}.

+send({repro_a, _} = To, Msg, _Conf) ->
+    %% Add a delay when sending messages to `repro_a` server.
+    %%
+    %% Without this delay, the leader will prevent `repro_a` from transitioning to `candidate` state
+    %% by promptly sending an empty `#append_entries_rpc{}` upon receiving `#pre_vote_rpc{}` from `repro_a`.
+    Delay = 
+        case get(ra_state) of
+            leader ->
+                1000;
+            _ ->
+                0
+        end,
+    get(delayed_sender) ! {send, Delay, To, Msg},
+    ok;
 send(To, Msg, Conf) ->
     % we do not want to block the ra server whilst attempting to set up
     % a TCP connection to a potentially down node or when the distribution
diff --git a/src/repro.erl b/src/repro.erl
new file mode 100644
index 0000000..b1bda56
--- /dev/null
+++ b/src/repro.erl
@@ -0,0 +1,51 @@
+-module(repro).
+
+-behaviour(ra_machine).
+
+-export([run/0]).
+
+-export([init/1, apply/3, state_enter/2]).
+
+
+run() ->
+    _ = file:del_dir_r("foo@localhost"),
+    ok = ra:start(),
+
+    %% Create a cluster with 3 members.
+    io:format("# create cluster~n"),
+    Module = ?MODULE,
+    Machine = {module, ?MODULE, #{}},
+    Node = node(),
+    ServerIds = [{repro_a, Node}, {repro_b, Node}, {repro_c, Node}],
+    {ok, _ServersStarted, []} = ra:start_cluster(default, Module, Machine, ServerIds),
+
+    io:format("# Please wait 5 seconds...~n"),
+    ok = timer:sleep(5000),
+
+    %% Assumes repro_c is the leader.
+    {repro_c, Node} = maps:get(leader_id, element(2, ra:member_overview(repro_a))),
+
+    %% Trigger an election that will cause the problem described in this issue.
+    io:format("# trigger election~n"),
+    ok = ra:trigger_election({repro_a, node()}),
+
+    ok.
+
+
+init(_) ->
+    io:format("* [~p] init~n", [name()]),
+    #{}.
+
+
+apply(_Metadata, _Command, State) ->
+    {State, ok}.
+
+
+state_enter(RaState, _State) ->
+    io:format("* [~p] state_enter: ~p~n", [name(), RaState]),
+    put(ra_state, RaState),
+    [].
+
+
+name() ->
+    element(2, erlang:process_info(self(), registered_name)).

Expected behavior

A leader should eventually be elected if the majority of members are alive.

Additional context

No response

lukebakken commented 1 month ago

@sile thank you for the detailed report.

Please open a pull request with your patch. If your repro.erl code can be turned into a test for this issue, that would be great. Thank you.

sile commented 1 month ago

@lukebakken Thank you for your response.

I will work on fixing this issue and submit a pull request. Writing a unit test seems challenging, but I will give it a try as well.

By the way, if there is a design document about the pre_vote state, please let me know. I am interested in understanding why this new state needed to be introduced into ra, what properties or invariants this state should maintain, and any other relevant details. This information would be very helpful as I consider the best approach to address this issue.

illotum commented 1 month ago

Just passing by, @sile, it is explained in §4.2.3 of the original paper.

sile commented 1 month ago

@illotum I wasn't aware of the paper. Thank you for the information!

rabbitmq / ra