Closed Ati59 closed 3 months ago
this is by design. if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge
When we resolve this, let's also close out:
https://github.com/solo-io/gloo/blob/main/projects/gloo/pkg/setup/setup.go#L46 is the line of code in question
This will be fixed in 1.17.0
Gloo Edge Version
1.13.x (latest stable)
Kubernetes Version
None
Describe the bug
A customer is facing regular kube-API outage (on all clouds AWS, Azure and GCP) and when it happens, gloo container is crashing on the gloo pod (because of the election not able to choose the lead). If the API server is unavailable during a scale-out event (increase of load for instance), the new gateway-proxy won't have the configuration from gloo due to this election problem.
Steps to reproduce the bug
iptables -A INPUT -p tcp --dport 6443 -j DROP
)Expected Behavior
Gloo should be resilient to the API outage, at least not crashing.
Additional Context
One or more envoy instances are not connected to the control plane for the last 1 minute
┆Issue is synchronized with this Asana task by Unito