Open benbuzbee opened 1 year ago
Hi @benbuzbee,
I'm not persuaded this is something that ought to be handled in the raft library itself. Moreover, the log you cite ("error waiting for Raft index") doesn't look like something from the library, but from Nomad, so it may be that what you're experiencing isn't purely a raft issue. I suggest you file this proposal as an issue on the https://github.com/hashicorp/nomad repo, and the maintainers of that project can decide whether it's better addressed in Nomad or here in the raft library.
If that is where you think this best lives. My suggestion here I think was largely because it is where healthy leadership heart beating exists.
Failure to load the snapshots exists in raft file_snapshot.go. Offhand I am not sure where the re-try loop exists but I suspect it is raft.
Does Nomad actually have what it needs to detect raft failing to load and abort the retries and modify the cluster?
Hi @benbuzbee,
I retract what I said earlier: I agree with your original statement
If the leader is broken because it cannot load the snapshots [...] the other server should realize the leader is useless and usurp him
Possible fix: in replicateTo, if we can't load a snapshot, we should step down as leader. The current code specifically doesn't stop replication for this error; it probably should, but there are likely other details we need to consider here.
Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it
We had a cluster of nomad servers that lost quorum and would not elect a new leader
Looking at the logs, the leader at the time was logging this
And other servers were logging this
So here is my stance: If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.
or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.
What do you think?