Closed sile closed 1 month ago
@sile thank you for the detailed report.
Please open a pull request with your patch. If your repro.erl
code can be turned into a test for this issue, that would be great. Thank you.
@lukebakken Thank you for your response.
I will work on fixing this issue and submit a pull request. Writing a unit test seems challenging, but I will give it a try as well.
By the way, if there is a design document about the pre_vote
state, please let me know. I am interested in understanding why this new state needed to be introduced into ra
, what properties or invariants this state should maintain, and any other relevant details. This information would be very helpful as I consider the best approach to address this issue.
Just passing by, @sile, it is explained in §4.2.3 of the original paper.
@illotum I wasn't aware of the paper. Thank you for the information!
Describe the bug
Let me report an issue we encountered while operating our service that uses
ra
.We were operating a 7-node cluster and stopped 3 of them for maintenance. After stopping the 3 nodes, our service became unavailable due to the absence of the Raft leader. It seemed that leader elections were executed periodically, but a new leader was never elected until we restarted member nodes.
I think that, as shown in the following reproduction steps section, this is a subtle bug relating to the
pre_vate
state (ra
original state which is not defined in the Raft paper), and how to fix this is not immediately obvious. Therefore, I think it would be better to leave the resolution of this issue to thera
dev team. However, since this is a critical issue for us, I am willing to create a PR ifra
team does not have enough resources to address this issue.Reproduction steps
Simplified scenario where this issue could occur
I guess a scenario like the following occurred:
ra
cluster consists of 3 members nameda
,b
, andc
c
is the leader with termN
and log indexM
(whereN
andM
are arbitrary integers)a
andb
are infollower
statea
transitions topre_vote
state:a
broadcasts#pre_vote_rpc{ term = N }
b
replies#pre_vote_result{ term = N, vote_granted = true }
toa
a
transitions tocandidate
state with termN + 1
a
broadcasts#request_vote_rpc{ term = N + 1 }
c
processes a command:c
increases local log index toM + 1
, and broadcasts#append_entries_rpc{ term = N }
b
increases local log index toM + 1
, and replies#append_entries_reply{ term = N, success = true }
toc
a
rejects the RPC asa
has a greater term thanc
(i.e., the local log index ofc
does not increase here)c
andb
receive#request_vote_rpc{ term = N + 1 }
froma
(this message was sent during step 2-4):c
transitions tofollower
state (asc
has an smaller term)c
replies#request_vote_result{ vote_granded = false }
as the local log index ofc
is higher thana
b
replies#request_vote_result{ vote_granded = false }
as the local log index ofb
is higher thana
a
initiates new elections but is never chosen as the next leader becausea
has a smaller log indexb
is stoppedc
transitions topre_vote
state:c
broadcasts#pre_vote_rpc{ term = N_ }
N_
is an integer larger thanN
N_
is incremented bya
each timea
initiates a new electiona
ignores#pre_vote_rpc{ term = N_ }
asa
is incandidate
state anda
's term is always equal to or larger thanN_
c
cannot transition tocandidate
state as there are not majority votesc
repeats step 6.a
remains incandidate
state (with a shorter log index thanc
)c
alternates betweenfollower
andpre_vote
states (with a term equal to or smaller thana
's term)Commands and a patch for reproduction
Please execute the following commands to reproduce the scenario described above. (The reproduction rate is not 100%, but it is high in my environment.)
ra.patch
Expected behavior
A leader should eventually be elected if the majority of members are alive.
Additional context
No response