When leader election fails, gloo crashes

kevin-shelaga commented 1 year ago

Gloo Edge Version

1.12.x (latest stable)

Kubernetes Version

1.21.x

Compromise: if gloo has good configuration it should not crash if it can not reach apiserver Namely we should do the following: Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

Describe the bug

At this moment we arent clear on the route cause, but when leader election fails gloo will crash. etcd and the masters were healthy, but there were resource limits on gloo during this time.

I1019 20:02:47.456285       1 leaderelection.go:248] attempting to acquire leader lease grp-gloo-system/gloo-ee...
I1019 20:03:04.435270       1 leaderelection.go:258] successfully acquired lease grp-gloo-system/gloo-ee
I1019 20:03:12.519136       1 trace.go:205] Trace[120689475]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (19-Oct-2022 20:02:47.564) (total time: 24954ms):
Trace[120689475]: ---"Objects listed" 24906ms (20:03:12.470)
Trace[120689475]: [24.954889394s] [24.954889394s] END
{"level":"error","ts":"2022-10-19T20:10:24.668Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:209","msg":"upstream grp-pc-claims-capabilities-loss-report.v1-review-update-postman-54655: port 8080 not found for service v1-review-update-postman-54655","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:209\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:230\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:257"}
{"level":"error","ts":"2022-10-19T20:22:13.930Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:209","msg":"upstream grp-pc-auto-3p-reports.v1-review-wiremock-testing-76192: port 8080 not found for service v1-review-wiremock-testing-76192","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:209\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:230\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/projects/gloo/pkg/plugins/kubernetes/eds.go:257"}
E1019 20:22:41.912708       1 leaderelection.go:367] Failed to update lock: etcdserver: request timed out
E1019 20:22:44.902885       1 leaderelection.go:330] error retrieving resource lock grp-gloo-system/gloo-ee: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/grp-gloo-system/leases/gloo-ee": context deadline exceeded
I1019 20:22:44.902969       1 leaderelection.go:283] failed to renew lease grp-gloo-system/gloo-ee: timed out waiting for the condition
{"level":"error","ts":"2022-10-19T20:22:44.902Z","logger":"gloo-ee","caller":"kube/factory.go:61","msg":"Stopped Leading","version":"1.12.28","stacktrace":"github.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/pkg/bootstrap/leaderelector/kube/factory.go:61\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}
{"level":"fatal","ts":"2022-10-19T20:22:44.903Z","caller":"setup/setup.go:47","msg":"lost leadership, quitting app","stacktrace":"github.com/solo-io/solo-projects/projects/gloo/pkg/setup.Main.func3\n\t/workspace/solo-projects/projects/gloo/pkg/setup/setup.go:47\ngithub.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.12.29/pkg/bootstrap/leaderelector/kube/factory.go:62\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}

Steps to reproduce the bug

N/A

Expected Behavior

leader election shouldnt cause a crash

Additional Context

No response

sam-heilbron commented 1 year ago

After a sync with @yuval-k @kevin-shelaga @nrjpoddar @EItanya we've decided to do the following:

Expose configuration to opt-out of leader election. Though it will be enabled by default, it would allow users who are running only a single replica of gloo to disable leader election
Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

kevin-shelaga commented 1 year ago

@sam-heilbron it looks like the leaseholder is incorrect and doesnt get updated during these crashes

sam-heilbron commented 1 year ago

Allow candidates who lose leadership to fallback to a follower gracefully. Previously, we fataled to ensure that we do not have multiple leaders at once. The downside to this, is that lost leadership can occur either as a result of throttling or network failure with the ApiServer, which may occur intermittently in an installation of Gloo Edge. While there are other solutions to reduce the chance of these happening, we will change our leadership code to revert back to a follower (ie not write statuses) instead of crashing.

Part 1 is complete and release in 1.13 and 1.12 EE. The second part has yet to be done.

davinkevin commented 1 year ago

This errors also appears when we install gloo, following the documentation, as only ingress-controller. The error is:

E1115 10:33:09.618789       1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo: leases.coordination.k8s.io "gloo" is forbidden: User "system:serviceaccount:gloo-system:gloo" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "gloo-system"

This is the case because this role is only available if the gateway is enabled.

SantoDE commented 1 year ago

Just for transparency,

HA for the Gloo Pod (translation, serving translated configuration to gateway and admission validation for new resource) is working since 1.12.32.

DoroNahari commented 1 year ago

@sam-heilbron should this issue be closed now?

davidjumani commented 2 months ago

This will be fixed in 1.17.0

solo-io / gloo