Open achamayou opened 5 years ago
We've noticed a similar issue in one of our tests. We suspend the leader node for some time, to force an election. The other node choose a new leader and happily make progress, but when the original leader is unsuspended it gets an unusually large tick (covering the entire span of its suspension time), and this triggers an election. We explored some mitigations for this, but fundamentally it falls into the same category; Raft requires regular, accurate time updates from the host, and without these it is possible to trigger spurious elections.
The only fix is some form of trusted time within the enclave, perhaps from node-gossip channels or perhaps from spinning to spend time within the enclave, but we have no firm plan for this yet.
We think implementation of the PreVote
extension to Raft (https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf 9.6, ticketed in #2577 ) will mitigate this problem without requiring an expensive busy-wait.
Reported by @dantengsky in #86
The most straightforward fix for this is to execute the random election timeout inside the enclave, to make sure it isn't shorter than a lower bound.