goraft / raft

UNMAINTAINED: A Go implementation of the Raft distributed consensus protocol.
MIT License
2.43k stars 479 forks source link

fix(server.go) fix potential deadlock when removing nodes #195

Closed xiang90 closed 10 years ago

xiang90 commented 10 years ago

When server entries removePeer, it is holding log write lock. peer.stopHeartbeat might also need to acquire log read lock to finish. We need to make peer.stopHeartbeat non-blocking to fix the deadlock. Also actually there is no need to wait for the peer go-routine to stop.

xiang90 commented 10 years ago

@unihorn Can you test this against etcd? Thanks.

xiang90 commented 10 years ago

@benbjohnson Our addPeer and removePeer are not safe if the leader dies before all the nodes commit the command. But I am not worrying about this right now.

yichengq commented 10 years ago

@xiangli-cmu I cherry-pick it into etcd, and etcd passes all its tests.

xiang90 commented 10 years ago

@unihorn Fix the race in remove test?

yichengq commented 10 years ago

@xiangli-cmu Yes, I think so. It is great!! :)

benbjohnson commented 10 years ago

@xiangli-cmu lgtm

philips commented 10 years ago

lgtm too. We need to document/refactor these locks.

xiang90 commented 10 years ago

@philips I was trying to refactor the heartbeat and log lock. But the result was not that great. I am planning to do it more carefully next week.

philips commented 10 years ago

@xiangli-cmu Sounds good. Thanks for taking it on.