restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.
https://docs.restate.dev
Other
1.66k stars 38 forks source link

Handle stale metadata version when starting a CC leader #2325

Closed muhamadazmy closed 1 week ago

muhamadazmy commented 1 week ago

Handle stale metadata version when starting a CC leader

Summary: In some situation, when a Cluster Controller leader is starting again some or all of the nodes in the cluster can have an outdated version of the metadata that has the new leader generational id

This causes some of the "get state" requests to fail, which causes the CC to think the nodes are dead and does an unnecessary reschedule of PPs

To avoid this, we introduces a third state "Suspect" node. Which is dead and alive until we can be sure

muhamadazmy commented 1 week ago

NOTE This PR should eventually be folded into #2252 since it fixes an issue in #2252. But I did it in its own PR because it's a big change and wanted to make sure that the approach is agreed upon before folding.