Kube API unavailability results in a gloo container crash

Ati59 commented 1 year ago

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

None

Describe the bug

A customer is facing regular kube-API outage (on all clouds AWS, Azure and GCP) and when it happens, gloo container is crashing on the gloo pod (because of the election not able to choose the lead). If the API server is unavailable during a scale-out event (increase of load for instance), the new gateway-proxy won't have the configuration from gloo due to this election problem.

Steps to reproduce the bug

Installing GE using this helm values :

global:
glooMtls:
enabled: true
istioSDS:
enabled: false

Make the API-server unavailable some times (I used iptables rule : iptables -A INPUT -p tcp --dport 6443 -j DROP)

Check the restart count and last container logs :

2023-04-20T11:39:21.664432264Z stderr F E0420 11:39:21.663749       1 leaderelection.go:330] error retrieving resource lock gloo-system/gloo-ee: Get "https://10.6.0.1:443/api/v1/namespaces/gloo-system/configmaps/gloo-ee": context deadline exceeded
2023-04-20T11:39:21.664668472Z stderr F I0420 11:39:21.664461       1 leaderelection.go:283] failed to renew lease gloo-system/gloo-ee: timed out waiting for the condition
2023-04-20T11:39:21.666594222Z stderr F {"level":"error","ts":"2023-04-20T11:39:21.665Z","logger":"gloo-ee","caller":"kube/factory.go:61","msg":"Stopped Leading","version":"1.13.9","stacktrace":"github.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:61\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}
2023-04-20T11:39:21.668590097Z stderr F {"level":"fatal","ts":"2023-04-20T11:39:21.667Z","caller":"setup/setup.go:49","msg":"lost leadership, quitting app","stacktrace":"github.com/solo-io/solo-projects/projects/gloo/pkg/setup.Main.func3\n\t/workspace/solo-projects/projects/gloo/pkg/setup/setup.go:49\ngithub.com/solo-io/gloo/pkg/bootstrap/leaderelector/kube.(*kubeElectionFactory).StartElection.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.13.8/pkg/bootstrap/leaderelector/kube/factory.go:62\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:203\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/go/pkg/mod/k8s.io/client-go@v0.22.4/tools/leaderelection/leaderelection.go:213"}

Expected Behavior

Gloo should be resilient to the API outage, at least not crashing.

Additional Context

On the customer case, a error message is then appearing on their log when it happens : One or more envoy instances are not connected to the control plane for the last 1 minute
On the customer case, federation is enabled

┆Issue is synchronized with this Asana task by Unito

kdorosh commented 1 year ago

this is by design. if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

sam-heilbron commented 1 year ago

When we resolve this, let's also close out:

sam-heilbron commented 1 year ago

https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/setup/setup.go#L46 is the line of code in question

davidjumani commented 3 months ago

This will be fixed in 1.17.0

solo-io / gloo