solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.08k stars 438 forks source link

Gloo pod is failing when all the upstreams are static and configured with consul integration #8425

Open prasanth-openet opened 1 year ago

prasanth-openet commented 1 year ago

Gloo Edge Version

1.13.x (beta)

Kubernetes Version

v1.24.0

Describe the bug

I am trying to install gloo chart (version v1.13.0) on kubernetes in a namespace other than 'gloo-system'. However I could see the sds scontainer in the gloo pod is not getting ready. In my values.yaml file, I have disabled the service discovery and also i have provided a consul integration details. However, the above issue is not happening when I install version v1.12.56 or previous versions. Seems like it is broken from gloo versions >=1.13.0 onwards.

appreciate if you can help thank you.

Steps to reproduce the bug

1) Install the the chart in a namespace other than gloo-system.

helm upgrade --install gloo -n my-namespace --create-namespace --wait --debug --values values.yaml gloo-1.13.0.tgz the following is my values.yaml file

discovery:
  enabled: false
settings:
  singleNamespace: true
  disableKubernetesDestinations: true
  integrations:
    consul:
      httpAddress: http://consul-consul-server.consul.svc:8500
      dnsAddress: kube-dns.kube-system.svc:53
      serviceDiscovery: {}
global:
  glooMtls:
    enabled: true
  image:
    registry: quay.io/solo-io

2) kubectl get pods -n gloo-system you could observe only 2 out 3 containers in the gloo pod is started.

the kubectl output is as follows

Every 2.0s: kubectl get pods -n gloo-system                                   Wed Jun 28 14:57:13 2023

NAME                                            READY   STATUS    RESTARTS   AGE
gateway-proxy-c5b68c9df-n4mzd   2/2       Running   0          43s
gloo-bd6954685-8cg7z                  2/3       Running   0          43s

3) glooctl check gives the following results

 Checking deployments... 1 Errors!
Checking pods... 2 Errors!
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking VirtualHostOptions... OK
Checking RouteOptions... OK
Checking secrets... OK
Checking virtual services... OK
Checking gateways... OK
Checking proxies... Skipping due to an error in checking deployments
Skipping due to an error in checking deployments
Error: 5 errors occurred:
        * Deployment gloo in namespace gloo-system is not available! Message: Deployment does not have minimum availability.
        * Pod gloo-bd6954685-8cg7z in namespace gloo-system is not ready! Message: containers with unready status: [sds]
        * Not all containers in pod gloo-bd6954685-8cg7z in namespace gloo-system are ready! Message: containers with unready status: [sds]
        * proxy check was skipped due to an error in checking deployments
        * xds metrics check was skipped due to an error in checking deployments

Not the issue is reproducible only if you use a namespace other than 'gloo-system'. In the test i am using 'my-namespace'. if we use the namespace gloo-system , the gloo pod is runs without any issues.

Expected Behavior

i should expect gloo pod should be started.

Additional Context

No response

ncouse commented 1 year ago

Some additional detail on the issue observered here.

This requires a least the following conditions:

We use Consul integration as default discovery mechanism. Discovery is turned off, as we don't want those auto-discovered upstreams - the side effect is there are no upstreams, until we add our own later after Gloo chart installation.

From checking the OSS code, we observe that the SDS container seems to no be Ready as the Gloo container has not opened its GRPC port yet. This is why we only see the issue it if mTLS is enabled.

This seems to be due to the startup order where Gloo container only opens the GPRC port after it checks for healthy endpoints. If we set endpointsWarmingTimeout to 0s to disable feature, we do not have this problem.

Also, we set singleNamespace and we install in a custom namespace (not gloo-system). This seems to be an important part of the problem. We don't want Gloo looking in other namespaces in our case, as install is restricted to single installed namspace. We observed that the this does not occur if installed in gloo-system namespace, but does if installed in any other namespace.

In summary, installing with the settings in original description, in a namespace other than gloo-system will observer the problem, and chart installation will fail/timeout.

ncouse commented 1 year ago

Currently the only way to workaround this is to either:

This was not an issue in previous versions of Gloo.

ncouse commented 1 year ago

Further analysis, by process of elimination of versions, shows that the bug was introduced in 1.13.0-beta10.

This version introduces some new settings for Consul.

We also now using settings under consulUpstreamDiscovery, as we have noticed that occasionally Gloo is out of sync with the services in Consul catalog. Changing these settings solve this.

However, these settings do not seem to directly affect the reproducibility of the issue in this ticket. However they may be relate given the features added in that beta release.

settings:
  singleNamespace: true
  disableKubernetesDestinations: true
  integrations:
    consul:
      httpAddress: http://consul-consul-server.consul.svc:8500
      dnsAddress: kube-dns.kube-system.svc:53
      serviceDiscovery: {}
    consulUpstreamDiscovery:
      consistencyMode: ConsistentMode
      edsBlockingQueries: true
      queryOptions:
        useCache: false
github-actions[bot] commented 3 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.