solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 437 forks source link

Kube API unavailability results in a gloo container crash #8107

Closed Ati59 closed 3 months ago

Ati59 commented 1 year ago

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

None

Describe the bug

A customer is facing regular kube-API outage (on all clouds AWS, Azure and GCP) and when it happens, gloo container is crashing on the gloo pod (because of the election not able to choose the lead). If the API server is unavailable during a scale-out event (increase of load for instance), the new gateway-proxy won't have the configuration from gloo due to this election problem.

Steps to reproduce the bug

Expected Behavior

Gloo should be resilient to the API outage, at least not crashing.

Additional Context

┆Issue is synchronized with this Asana task by Unito

kdorosh commented 1 year ago

this is by design. if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

sam-heilbron commented 1 year ago

When we resolve this, let's also close out:

sam-heilbron commented 1 year ago

https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/setup/setup.go#L46 is the line of code in question

davidjumani commented 3 months ago

This will be fixed in 1.17.0