Open schuster opened 9 years ago
Relatedly, beginElection
sets a timeout, but Candidate does not listen for StateTimeout messages, so the timeout would never be caught (if I'm reading things correctly). Even if that is fixed, the timeout would still need to be reset on a Candidate->Candidate transition in case multiple elections fail.
I think Candidate did previously catch timeouts correctly [here], but you're right that it did not reset them properly on the Candidate -> Candidate transition after multiple elections fail.
For what it's worth, I have a (non-pull-request-worthy) fix for this issue here:
https://github.com/NetSys/sts2-applications/commit/55e6077e6c9294b797e597c6701ff2f0e8907f6c
In the unlikely event that no candidate receives the majority vote for a given term, the candidates should timeout and start a new election. Currently there is no timeout, so eventually all servers in the cluster would transition to the Candidate state for the same term and get stuck.