Loosing leader election leading to a restart that causes CI failure

kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.

https://kuma.io/install

Apache License 2.0

3.68k stars 333 forks source link

Loosing leader election leading to a restart that causes CI failure #11090

Open slonka opened 3 months ago

slonka commented 3 months ago

Description

Some time ago we introduced a restart counter in our E2E tests to check if the CP does not restart (theoretically if everything is fine it shouldn't and a restart could indicate some problem with the CP like OOM). This works fine in general but leader election on k8s kills the CP and that causes the CI to "fail".

One idea (from @michaelbeaumont) is to distinguish these restarts by exit code and filter them out.

slonka commented 3 months ago

Triage: this assumes that exit code is different for leader election: needs checking

michaelbeaumont commented 3 months ago

I think the best we can do here, if we want to be safe, is if restartCount: 1 and lastState.terminated has exitCode: <leader election lost code>, after my change to exit with 0 on leader lost, then we can ignore it. Unfortunately because only the last termination is kept, if restartCount > 1 then we don't know if it may have exited with an error.

github-actions[bot] commented 1 week ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.