Open colin-scott opened 9 years ago
Excellent work Colin, looks really good. I need to find the time to go over your discovery!
I was able to minimize this test case further, by decreasing the cluster size.
Repro steps for smaller case:
$ git clone -b raft-shrunk-leader-safety git@github.com:NetSys/sts2-applications.git
$ cd sts2-applications
$ git clone git@github.com:NetSys/sts2-experiments.git experiments
$ sbt assembly
$ java -d64 -Xmx15g -cp target/scala-2.11/randomSearch-assembly-0.1.jar Main 2>&1 | tee console.out
Console output.
Messages delivered:
MsgEvent(deadLetters,raft-member-8,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-3, raft-member-6, raft-member-7, raft-member-8))))
MsgEvent(deadLetters,raft-member-7,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-3, raft-member-6, raft-member-7, raft-member-8))))
MsgEvent(deadLetters,raft-member-3,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-3, raft-member-6, raft-member-7, raft-member-8))))
MsgEvent(raft-member-8,raft-member-8,BasicFingerprint(BeginElection))
MsgEvent(raft-member-7,raft-member-7,BasicFingerprint(BeginElection))
MsgEvent(raft-member-8,raft-member-3,BasicFingerprint((RequestVote,Term(1),raft-member-8,Term(0),0)))
MsgEvent(raft-member-7,raft-member-7,BasicFingerprint(BeginElection))
MsgEvent(raft-member-8,raft-member-8,BasicFingerprint(BeginElection))
MsgEvent(deadLetters,raft-member-6,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-3, raft-member-6, raft-member-7, raft-member-8))))
MsgEvent(raft-member-7,raft-member-3,BasicFingerprint((RequestVote,Term(2),raft-member-7,Term(0),0)))
MsgEvent(raft-member-3,raft-member-8,BasicFingerprint(VoteCandidate(Term(1))))
MsgEvent(raft-member-3,raft-member-3,BasicFingerprint(BeginElection))
MsgEvent(raft-member-8,raft-member-6,BasicFingerprint((RequestVote,Term(1),raft-member-8,Term(0),0)))
MsgEvent(raft-member-6,raft-member-8,BasicFingerprint(VoteCandidate(Term(0))))
MsgEvent(raft-member-3,raft-member-7,BasicFingerprint(VoteCandidate(Term(2))))
MsgEvent(raft-member-7,raft-member-8,BasicFingerprint((RequestVote,Term(2),raft-member-7,Term(0),0)))
At the end of this, raft-member-7 and raft-member-8 are both elected as leader in the same term.
@colin-scott Hi! I tried to run the test case that you provided but it appears that the code is changed and after running
git subtree pull --prefix=interposition interposition master
I get a lot of merge conflicts and I'm unable to run the test. I want to try fixing this issue and probably others too as I'm planning to use this library in my own project.
Hey @dmitraver,
Thanks for pointing out the issue. I corrected the repro steps above; if you start the repro steps from scratch it should work.
(The raft-shrunk-leader-safety
branch already has the correct dependencies, so there shouldn't be a need to pull in the subtree and merge)
Thanks! -Colin
@dmitraver for what it's worth, I think the root cause of this bug is described in this issue:
@colin-scott Thx Colin! I will look into that.
Thanks guys! Still wasn't able to resume work on it, a lot of stuff on Akka itself needs shipping soon so focused my efforts there :)
One note:
I'm planning to use this library in my own project.
Please don't if it's any real-life / production project – this implementation has been proven to not yet be correct so it would be a bad idea to use in a real project. Safety first!
@ktoso Akka is awesome tool! I just want to contribute to this project :) I've seen a lot of issues was opened previously so will try to fix as much as I can and make a PR. This is not intended to be used in production(maybe later) but more for my own akka studies.
Awesome! For learning things on distributed systems and Akka + fixing things this project is ideal I think :-) Looking forward to hear from you then – feel free to open/ask on issues if you have any trouble,
@ktoso Thanks for support :)
Calling this a "test case" is a bit of a misnomer. Think of it as a replayable execution. It replays by trying to deliver the same (equivalent) messages it observed in the original execution. If some of those messages become missing, e.g. after you modified the akka-raft code, it fails to replay properly. Note though that that doesn't necessarily mean that you fixed the bug.
I can show you how I ran the initial fuzz test if that would be easier. Or I can show you how to run the replayer in a mode that doesn't crash whenever it diverges.
On Thu, Jul 23, 2015 at 5:46 AM Dmitry Avershin notifications@github.com wrote:
@colin-scott https://github.com/colin-scott Hi Colin! I did some research recently and found that candidate doesn't recreate ElectionMeta when it receives a BeginElection message and therefore at the beginning of new election it has some votes that were received previously(possibly in the other terms too). I tried to change the following line https://github.com/ktoso/akka-raft/blob/master/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala#L31 to
m.forNewElection.incVote
which should solve the issue but then your test starts failing with
Exception in thread "main" akka.dispatch.verification.ReplayException: Expected event (raft-member-9,raft-member-3,BasicFingerprint((RequestVote,Term(2),raft-member-9,Term(0),0))) at akka.dispatch.verification.ReplayScheduler.replay(ReplayScheduler.scala:124) at akka.dispatch.verification.RunnerUtils$.replayExperiment(RunnerUtils.scala:139)
I tried to find what causes this issue but stuck. Could you give me some insight what is going wrong?
— Reply to this email directly or view it on GitHub https://github.com/ktoso/akka-raft/issues/47#issuecomment-124083867.
I think I fixed that issue, now your test shows no errors and passes successfully. The issue was with stale term number in VoteCandidate message. I will make a PR soon. Thanks for your support. Your tests were really helpful :)
2015-07-23 20:35 GMT+02:00 Colin Scott notifications@github.com:
Calling this a "test case" is a bit of a misnomer. Think of it as a replayable execution. It replays by trying to deliver the same (equivalent) messages it observed in the original execution. If some of those messages become missing, e.g. after you modified the akka-raft code, it fails to replay properly. Note though that that doesn't necessarily mean that you fixed the bug.
I can show you how I ran the initial fuzz test if that would be easier.
On Thu, Jul 23, 2015 at 5:46 AM Dmitry Avershin notifications@github.com wrote:
@colin-scott https://github.com/colin-scott Hi Colin! I did some research recently and found that candidate doesn't recreate ElectionMeta when it receives a BeginElection message and therefore at the beginning of new election it has some votes that were received previously(possibly in the other terms too). I tried to change the following line < https://github.com/ktoso/akka-raft/blob/master/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala#L31
to
m.forNewElection.incVote
which should solve the issue but then your test starts failing with
Exception in thread "main" akka.dispatch.verification.ReplayException: Expected event (raft-member-9,raft-member-3,BasicFingerprint((RequestVote,Term(2),raft-member-9,Term(0),0))) at akka.dispatch.verification.ReplayScheduler.replay(ReplayScheduler.scala:124) at akka.dispatch.verification.RunnerUtils$.replayExperiment(RunnerUtils.scala:139)
I tried to find what causes this issue but stuck. Could you give me some insight what is going wrong?
— Reply to this email directly or view it on GitHub https://github.com/ktoso/akka-raft/issues/47#issuecomment-124083867.
— Reply to this email directly or view it on GitHub https://github.com/ktoso/akka-raft/issues/47#issuecomment-124202917.
Best wishes, Dmitry Avershin.
Hello,
I have a test case that, as far as I can tell, causes akka-raft to violate Raft's "Election Safety" property (see Figure 3 from the paper), i.e. it appears that two leaders are elected for the same term.
The test case consists of the following external events:
ChangeConfiguration
messages containing ActorRefs for all 9 RaftActors. raft-member-8 does not receive aChangeConfiguration
message.Upon running the test, I see the following in the console output:
For what it's worth, rather than inspecting the console output to detect this bug, we took a distributed snapshot of all RaftActor's states and found that in the same snapshot raft-member-9 is in state
LeaderMeta(Actor[akka://new-system-0/user/raft-member-9],Term(2))
while raft-member-7 is in stateLeaderMeta(Actor[akka://new-system-0/user/raft-member-7],Term(2))
.In our failing execution, we have the akka runtime deliver 27 total messages, including the 8
ChangeConfiguration
messages. The delivery order is as follows (format is sender,receiver,message):Based on that delivery order, it appears that raft-member-9 receives votes from raft-member-{1,6,3,2,9}, and raft-member-7 receives votes from raft-member-{1,5,4,7}. A few things are strange about this: the votes received by raft-member-9 appear to be from different Terms; and raft-member-7 does not actually receive a quorum of votes (regardless of Term). I'm not exactly sure what the root cause is here.
We made akka's message scheduler deterministic so that you can easily reproduce the bug for yourself. Steps to reproduce:
From there you should be able to add logging statements and continue replaying as many times as needed.
We made a few small changes to akka-raft to generate this test case:
electionDeadline.isOverdue()
. This is again to make the execution deterministic (sinceelectionDeadline.isOverdue()
callsgettimeofday
), but that shouldn't affect correctness as far as I can tell.Let me know if you have any questions about how we ran this test.
Thanks! -Colin