ktoso / akka-raft

A toy project implementing RAFT on top of Akka Cluster (not prod ready)
http://blog.project13.pl
Apache License 2.0
280 stars 42 forks source link

Any exception can cause two leaders to become elected in the same Term #62

Closed colin-scott closed 8 years ago

colin-scott commented 9 years ago

The Raft protocol assumes a crash-recovery failure model -- that is, it allows for the possibility that crashed nodes will come back (with non-volatile state intact).

Currently, akka-raft does not write anything to disk. That's actually fine -- it just means that akka-raft currently assumes a crash-stop failure model, where crashed nodes are never allowed to come back.

However, akka-raft should enforce crash-stop! It currently does not, and this can lead to problems.

More specifically:

By default, akka's supervision strategy restarts any actor that throws an exception. When it restarts, all its state is reinitialized.

In the Raft paper, the votedFor variable is listed as persistent (non-volatile) state on each server. If votedFor is forgotten, a node may end up voting for two different candidates in the same term.

If RaftActor throws an exception for any reason, it may end up in the situation where it eventually votes for two different candidates in the same, possibly causing two different leaders to be elected.

I see two solutions:

I discovered this bug because one of my recent fixes accidentally triggered an exception -- and later in my fuzz run I saw that two leaders became elected in the same term.

ktoso commented 9 years ago

Good catch (as always), initially I was planning to go for the persistent route.

The persistence layer could be done via akka-persistence, here's an old ticket for it https://github.com/ktoso/akka-raft/issues/7

dmitraver commented 9 years ago

I'm working on it. Also I think it makes sense to update the libraries to the current akka version, at least the persistence module is no longer marked as experimental in 2.4.

ktoso commented 9 years ago

Thanks @dmitraver! Yeah, it makes sense to upgrade to 2.4, not only because of that, but also because a lot of cluster things have improved stability now.

dmitraver commented 8 years ago

@ktoso I actually implemented the desired behaviour that ensures that non-volatile state such as votedFor and currentTerm is persisted. I used persistent FSM for that. However, it looks like TestFSMRef class that is used almost everywhere in tests doesn't support the FSM persistent actors cause I'm getting the following error: Error:(18, 29) Cannot prove that pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D]. val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with EventStreamAllMessages) I checked the apply method of the class and found that its signature requires having a FSM trait def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:< FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the TestFSMRef internals. I can actually make a PR with this change but I'm not sure that its a good idea without having working tests.

ktoso commented 8 years ago

Sounds great, thanks for your work. I think the project should move away from test actor ref in any case by the way, so that would be another reason to do so.

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com Reply: ktoso/akka-raft reply@reply.github.com reply@reply.github.com Date: 7 May 2016 at 17:11:35 To: ktoso/akka-raft akka-raft@noreply.github.com akka-raft@noreply.github.com CC: Konrad Malawski konrad.malawski@project13.pl konrad.malawski@project13.pl, Mention mention@noreply.github.com mention@noreply.github.com Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to become elected in the same Term (#62)

@ktoso https://github.com/ktoso

I actually implemented the desired behaviour that ensures that non-volatile state such as votedFor and currentTerm is persisted. I used persistent FSM for that. However, it looks like TestFSMRef class that is used almost everywhere in tests doesn't support the FSM persistent actors cause I'm getting the following error:

Error:(18, 29) Cannot prove that pl.project13.scala.akka.raft.SnapshottingWordConcatRaftActor with pl.project13.scala.akka.raft.EventStreamAllMessages <:< akka.actor.FSM[S,D]. val candidate = TestFSMRef(new SnapshottingWordConcatRaftActor with EventStreamAllMessages)

I checked the apply method of the class and found that its signature requires having a FSM trait def apply[S, D, T <: Actor: ClassTag](factory: ⇒ T)(implicit ev: T <:< FSM[S, D], system: ActorSystem) which won't work with PersistentFSMBase trait that is used in persistence fsm.

I'm not sure whether its possible to overcome this without changing the TestFSMRef internals. I can actually make a PR with this change but I'm not sure that its a good idea without having working tests.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ktoso/akka-raft/issues/62#issuecomment-217643371

dmitraver commented 8 years ago

Sounds good to me but what are other alternatives? I haven't had a lot of experience with akka testing so my knowledge is a bit limited here :) I can fix the tests and make a PR request once its ready .

ktoso commented 8 years ago

Just normal messaging using actors and test probes. I'll likely lose internet soon now (airborne), but will try to carve out some time to give you hints if more needed.

Thanks for hakking on the project :)

Konrad Malawski

From: Dmitry Avershin notifications@github.com notifications@github.com Reply: ktoso/akka-raft reply@reply.github.com reply@reply.github.com Date: 7 May 2016 at 17:18:53 To: ktoso/akka-raft akka-raft@noreply.github.com akka-raft@noreply.github.com CC: Konrad Malawski konrad.malawski@project13.pl konrad.malawski@project13.pl, Mention mention@noreply.github.com mention@noreply.github.com Subject: Re: [ktoso/akka-raft] Any exception can cause two leaders to become elected in the same Term (#62)

Sounds good to me but what are other alternatives? I haven't had a lot of

experience with akka testing so my knowledge is a bit limited here :) I can fix the tests and make a PR request once its ready .

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ktoso/akka-raft/issues/62#issuecomment-217643712

dmitraver commented 8 years ago

Thanks, its real fun :) I will try to fix the tests then and hopefully make a PR soon.

dmitraver commented 8 years ago

I investigated the test logic a bit more and found that actually the whole test suit relies on TestActorRef and TestFsmRef. There are three "unit" tests that use TestFsmRef - Leader/Follower/Candidate Tests. Other "integration" tests use TestActorRef/TestFsmRef indirectly in RaftSpec. The whole test suite and almost all of the tests should be rewritten to use plain actors and I think the only way to simulate the old test behavior is to subscribe to certain messages and then wait for them to be received. The hardest part in my opinion is to rewrite "unit" tests due to non deterministic behavior of integration testing.