Closed xiang90 closed 10 years ago
@benbjohnson @philips Here is my simple bench tool: https://github.com/xiangli-cmu/raft-bench
@xiangli-cmu Can you please make the commit message more helpful? It is better to have information in the git history rather than on github.
This makes sense to me logically. I don't see what the hack is in it though.
@philips The root problem is that the heartbeat can still be running after the server steps-down to a follower/candidate and increases its term. The best approach is to stop all the heartbeat before increase its term.
I do not do it now since there are some more changes I want to make together.
We have separate go-routine for each peer's heartbeat. I want to avoid theses go-routines and try to make most of the peer synchronized. Thus, we might find a way to reuse the encoding results, which reduce the CPU usage by a factor of number of followers. And this can also make the codes simpler.
@philips More docs in the commit message.
:+1:
This commit introduces a deadlock over the server lock:
flush()
data to Node2.flush()
to Node2, knowing nothing of the above, attempts to send an AppendEntries request. It calls p.server.Transporter()
, which attempts to get a read lock on the server. (there are many other calls, like ones to DispatchEvent()
and Term()
, which similarly grab read locks)flush()
goroutine nor the setCurrentTerm()
goroutine can proceed.@zenazn Fixing
@zenazn I think dispatcher is holding its own lock. @benbjohnson All the other read locks are holding for debugging, shall we just drop them? We added theses locks to let go -race happy.
It might be beneficial to take a first pass that focusses entirely on correctness (and simplicity) of the code in terms of memory safety, even if the cost is significant in terms of performance.
In many cases this may mean less granular locking, less work performed in dueling goroutines, etc. I think that's fine to start off with.
@mreiferson The locks come from our old codebase, when the server is not in a event loop. We largely refactor the codebase to make server in a event loop. I am going to clean up all locks soon.
@mreiferson I think it makes sense to go through and clean up a lot of this. If we remove the buffered channel (#160) and move heartbeating initiation to the server then we could essentially have a completely single threaded implementation. That would make everything a lot easier to reason about and we could probably drop many of the locks.
I agree with @xiangli-cmu that a lot of the quirkiness comes from the rewrite from the non-event loop design. But it's time to get it cleaned up now.
I've also been thinking about splitting off the individual states into their own types and have the Server
essentially be a wrapper that only holds one state at a time and just handles the handoff between states. It's essentially just a state machine. I haven't quite got the whole design figured out yet though (and I'm not entirely sure it's the best plan either). :)
@benbjohnson I have finished making things in a single thread. I will send you a pull request. But before this, I think I need to send you guys a design doc to clear all these sort of thing.
:+1: to state machine
Leader must stop all the heartbeat routines before step-down to ensure safety.