Open c-kruse opened 3 weeks ago
Options to mitigate:
pessimistic: continue to try and respect the lock, only running status sync and controller when the lock is held, but try and re-aquire the lock when lost.
optimistic: risk multiple active status sync and controllers running by "ignoring" losing the lock. Definitely not ideal, and I do think that we'd run into weird edge cases especially around the coming and going of duplicate controller event sources, but I think would mostly work okay aside from extra noise.
For posterity: Looks like at least in v2 we're going to try a third option - restart when we lose the lock.
Describe the bug I suspect that when a router holds the skupper-site-leader lock and some disruption causes it to lose that lock, that router will never attempt to reclaim the lock. This would mean the network status would stop getting updated until that router is restarted (and in v2 the vanflow controller event source would stop producing events.)
How To Reproduce Steps to reproduce the behavior: TBD
Expected behavior I would expect the leader election mechanism to backoff and retry.
Environment details Observed in v2 after one of two cluster nodes was temporarily "not ready". The logic here is not obviously different than it has been in v1, but I do not have a clever way to confirm that yet.
EDIT Encountered this again, this time the router container was restarted after failing its health check and config-sync was allowed to continue running. Somewhere in there config-sync lost its lock.