Handle stale metadata version when starting a CC leader

restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.

Other

1.66k stars 38 forks source link

Handle stale metadata version when starting a CC leader

Summary: In some situation, when a Cluster Controller leader is starting again some or all of the nodes in the cluster can have an outdated version of the metadata that has the new leader generational id

This causes some of the "get state" requests to fail, which causes the CC to think the nodes are dead and does an unnecessary reschedule of PPs

To avoid this, we introduces a third state "Suspect" node. Which is dead and alive until we can be sure

restatedev / restate

Handle stale metadata version when starting a CC leader #2325