Node in singleton cluster never becomes leader

uwplse / verdi-raft

An implementation of the Raft distributed consensus protocol, verified in Coq using the Verdi framework

BSD 2-Clause "Simplified" License

185 stars 19 forks source link

Node in singleton cluster never becomes leader #40

Open tbg opened 7 years ago

tbg commented 7 years ago

I'm trying to run the benchmarks against a single-node system:

$ ./vard.native -dbpath /tmp/vard-8000 -port 8000 -me 0 -node 0,127.0.0.1:9000 -debug
unordered shim running setup for VarD
unordered shim ready for action
client 115512982 connected on 127.0.0.1:49446
client 115512982 disconnected: client closed socket

The client logged above is the following invocation:

python2 bench/setup.py --service vard --keys 50 --cluster 127.0.0.1:8000
Traceback (most recent call last):
  File "bench/setup.py", line 34, in <module>
    main()
  File "bench/setup.py", line 27, in main
    host, port = Client.find_leader(args.cluster)
  File "/Users/tschottdorf/tla/verdi-raft/extraction/vard/bench/vard.py", line 27, in find_leader
    raise cls.NoLeader
vard.NoLeader

I haven't dug deeper but I did verify that I can run the benchmarks against a three-node cluster (everything running on the same machine). So, perhaps I'm silly or there is a problem with the edge case of a single-node system.

palmskog commented 7 years ago

I'm pretty sure this is a liveness bug (and thus an issue outside the scope of election safety, which is guaranteed). What happens is that the singleton node never manages to elect itself leader - it waits forever for a RequestVoteReply message.

The tryToBecomeLeader function in raft/Raft.v is called when a timeout occurs. However, tryToBecomeLeader does not immediately check whether the candidate wins the vote. This is only done once a RequestVoteReply message is received, using a call to wonElection.

The original Go implementation of Raft uses a general loop for the Candidate state that first sends all necessary RequestVote messages and then immediately checks whether it has enough votes (and becomes leader if possible). The bug could be fixed by adding a similar check to tryToBecomeLeader, but I'm not sure how much that would mess with the proofs. Arguably, there is no point in running Raft in a singleton node cluster anyway - it's enough to run a system that directly uses the underlying state machine (VarD).

tbg commented 7 years ago

@palmskog thanks, that sounds like the problem. Seeing that the code isn't going to be used in production any time soon (ever?) I agree that it's probably not worth fixing unless it's easy. But I think it's worth pointing out the limitation in the README so that others don't run into it (running the most trivial case is a natural first thing to try).

As a hack, I tried making the node send a vote request to itself instead of marking down its self-vote internally, but that messed with some proof:

File "./raft/RefinementSpecLemmas.v", line 617, characters 2-6:
Error: Attempt to save an incomplete proof (in proof update_elections_data_timeout_votesWithLog_votesReceived)

and similarly it failed when I tried to get hackier and let the node record its self-vote immediately, but not record that it had voted (so that it would still message itself and trigger the receiver). Well, that too didn't work:

File "./raft-proofs/CandidatesVoteForSelvesProof.v", line 46, characters 2-6:
Error: Attempt to save an incomplete proof (in proof candidates_vote_for_selves_timeout)

At that point I gave up (but haven't tried your suggestion).

For the hypothetical case of this being production-ready code, it would really have to work though. Systems built on top of the database can't run certain smoke or acceptance tests without a running cluster, and one shouldn't have to orchestrate multiple nodes just to get something to talk to.

palmskog commented 7 years ago

@tschottdorf I can't speak for the original Verdi team, but production-ready code is generally not a goal in academic projects, nor encouraged to any significant degree by stakeholders (funding agencies, thesis committees, etc.). With that said, Verdi Raft has accumulated more software engineering effort than most similar projects (possibly excluding Microsoft's IronFleet), for example (1) extensive use of continuous integration, (2) automated dependency management via OPAM, (3) unit tests and integration tests for unverified code, (4) cluster deployment and management automation via Capistrano (deploy branch).

My feeling is that changing tryToBecomeLeader to directly set type := Leader (and properly updating electoralVictories, etc.) when winning an election "by default" is the way to go. Note that there is nothing to stop you from changing the handler code in raft/Raft.v and then extracting out OCaml code without checking proofs - either run make vard-quick in the root directory, or just make in extraction/vard.

I the meantime, I created a tentative branch where singleton clusters are prevented from starting up at the command-line level.

tbg commented 7 years ago

I can't speak for the original Verdi team, but production-ready code is generally not a goal in academic projects, nor encouraged to any significant degree by stakeholders (funding agencies, thesis committees, etc.). With that said, Verdi Raft has accumulated more software engineering effort than most similar projects (possibly excluding Microsoft's IronFleet), for example (1) extensive use of continuous integration, (2) automated dependency management via OPAM, (3) unit tests and integration tests for unverified code, (4) cluster deployment and management automation via Capistrano (deploy branch).

@palmskog I agree, that's why I pointed it out in the first place - wouldn't have bothered if this didn't look like it'd been pushed fairly far out of the bounds of pure academia already

I'll close this issue, but I'll keep your above suggestion for a proper fix in mind. Could be a good exercise unless it spirals out of control.

palmskog commented 7 years ago

I'm going to re-open this issue since the current solution is unsatisfactory in the long term. Nobody has any cycles to fix this properly right now, but we might get around to it in the future and will happily review any PRs with proposed solutions. Thanks to @tschottdorf for reporting this.