Handle stale metadata version when starting a CC leader
Summary:
In some situation, when a Cluster Controller leader is starting again
some or all of the nodes in the cluster can have an outdated version of
the metadata that has the new leader generational id
This causes some of the "get state" requests to fail, which causes the CC
to think the nodes are dead and does an unnecessary reschedule of PPs
To avoid this, we introduces a third state "Suspect" node. Which is
dead and alive until we can be sure
NOTE This PR should eventually be folded into #2252 since it fixes an issue in #2252. But I did it in its own PR because it's a big change and wanted to make sure that the approach is agreed upon before folding.
Handle stale metadata version when starting a CC leader
Summary: In some situation, when a Cluster Controller leader is starting again some or all of the nodes in the cluster can have an outdated version of the metadata that has the new leader generational id
This causes some of the "get state" requests to fail, which causes the CC to think the nodes are dead and does an unnecessary reschedule of PPs
To avoid this, we introduces a third state "Suspect" node. Which is dead and alive until we can be sure