Closed kevin-shelaga closed 2 months ago
After a sync with @yuval-k @kevin-shelaga @nrjpoddar @EItanya we've decided to do the following:
@sam-heilbron it looks like the leaseholder is incorrect and doesnt get updated during these crashes
- Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.
Part 1 is complete and release in 1.13 and 1.12 EE. The second part has yet to be done.
This errors also appears when we install gloo, following the documentation, as only ingress-controller
.
The error is:
E1115 10:33:09.618789 1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo: leases.coordination.k8s.io "gloo" is forbidden: User "system:serviceaccount:gloo-system:gloo" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "gloo-system"
This is the case because this role is only available if the gateway is enabled.
Just for transparency,
HA for the Gloo Pod (translation, serving translated configuration to gateway and admission validation for new resource) is working since 1.12.32
.
@sam-heilbron should this issue be closed now?
This will be fixed in 1.17.0
Gloo Edge Version
1.12.x (latest stable)
Kubernetes Version
1.21.x
Compromise: if gloo has good configuration it should not crash if it can not reach apiserver Namely we should do the following:
Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.
Describe the bug
At this moment we arent clear on the route cause, but when leader election fails gloo will crash. etcd and the masters were healthy, but there were resource limits on gloo during this time.
Steps to reproduce the bug
N/A
Expected Behavior
leader election shouldnt cause a crash
Additional Context
No response