more safety - Githubissues

xiang90 commented 10 years ago

Leader must stop all the heartbeat routines before step-down to ensure safety.

xiang90 commented 10 years ago

@benbjohnson @philips Here is my simple bench tool: https://github.com/xiangli-cmu/raft-bench

philips commented 10 years ago

@xiangli-cmu Can you please make the commit message more helpful? It is better to have information in the git history rather than on github.

philips commented 10 years ago

This makes sense to me logically. I don't see what the hack is in it though.

xiang90 commented 10 years ago

@philips The root problem is that the heartbeat can still be running after the server steps-down to a follower/candidate and increases its term. The best approach is to stop all the heartbeat before increase its term.

I do not do it now since there are some more changes I want to make together.

We have separate go-routine for each peer's heartbeat. I want to avoid theses go-routines and try to make most of the peer synchronized. Thus, we might find a way to reuse the encoding results, which reduce the CPU usage by a factor of number of followers. And this can also make the codes simpler.

xiang90 commented 10 years ago

@philips More docs in the commit message.

benbjohnson commented 10 years ago

:+1:

zenazn commented 10 years ago

This commit introduces a deadlock over the server lock:

WLOG Node1 is the leader. It has a goroutine attempting to flush() data to Node2.
Node1 receives a response to a request from Node3. It discovers that an election has occurred, and that its term is no longer valid. It steps down.
As part of the step down, holding the server lock, it attempts to synchronously shoot down the heartbeat goroutines (https://github.com/goraft/raft/blob/b545b8d44e8d1c24a590e50b11fa832089f9f12c/server.go#L492-L506)
Meanwhile, the goroutine calling flush() to Node2, knowing nothing of the above, attempts to send an AppendEntries request. It calls p.server.Transporter(), which attempts to get a read lock on the server. (there are many other calls, like ones to DispatchEvent() and Term(), which similarly grab read locks)
Neither the flush() goroutine nor the setCurrentTerm() goroutine can proceed.

xiang90 commented 10 years ago

@zenazn Fixing

xiang90 commented 10 years ago

@zenazn I think dispatcher is holding its own lock. @benbjohnson All the other read locks are holding for debugging, shall we just drop them? We added theses locks to let go -race happy.

mreiferson commented 10 years ago

It might be beneficial to take a first pass that focusses entirely on correctness (and simplicity) of the code in terms of memory safety, even if the cost is significant in terms of performance.

In many cases this may mean less granular locking, less work performed in dueling goroutines, etc. I think that's fine to start off with.

xiang90 commented 10 years ago

@mreiferson The locks come from our old codebase, when the server is not in a event loop. We largely refactor the codebase to make server in a event loop. I am going to clean up all locks soon.

benbjohnson commented 10 years ago

@mreiferson I think it makes sense to go through and clean up a lot of this. If we remove the buffered channel (#160) and move heartbeating initiation to the server then we could essentially have a completely single threaded implementation. That would make everything a lot easier to reason about and we could probably drop many of the locks.

I agree with @xiangli-cmu that a lot of the quirkiness comes from the rewrite from the non-event loop design. But it's time to get it cleaned up now.

I've also been thinking about splitting off the individual states into their own types and have the Server essentially be a wrapper that only holds one state at a time and just handles the handoff between states. It's essentially just a state machine. I haven't quite got the whole design figured out yet though (and I'm not entirely sure it's the best plan either). :)

xiang90 commented 10 years ago

@benbjohnson I have finished making things in a single thread. I will send you a pull request. But before this, I think I need to send you guys a design doc to clear all these sort of thing.

mreiferson commented 10 years ago

:+1: to state machine

goraft / raft

more safety #167