hashicorp / consul-helm

Helm chart to install Consul and other associated components.
Mozilla Public License 2.0
419 stars 386 forks source link

Consul ingress gateways not starting after chart upgrade #971

Closed rrondeau closed 3 years ago

rrondeau commented 3 years ago

Overview of the Issue

Consul ingress gateways not starting after helm upgrade with values

Reproduction Steps

Steps to reproduce this issue, eg:

  1. When running helm install with the following values.yml:
    
    global:
    domain: "consul"
    datacenter: "dc1"
    image: "docker.io/consul:1.9.5"
    imageK8S: "docker.io/hashicorp/consul-k8s:0.25.0"
    imageEnvoy: "envoyproxy/envoy-alpine:v1.16.3"

.... ....

connectInject: enabled: true healthChecks: enabled: false k8sAllowNamespaces: [........] centralConfig: enabled: true resources: requests: memory: "100Mi" cpu: "50m" limits: memory: "150Mi" cpu: "50m"

ingressGateways: enabled: true gateways:

No logs from the container :/

Environment details

If not already included, please provide the following:

Additionally, please provide details regarding the Kubernetes Infrastructure, if known:

Additional Context

After some heavy debugging, the start command seems to be stuck when consul create a pipe to forward the boostrap config to envoy with consul connect envoy ...

To fix the issue i found a workaround, i replace the pod command from :

command:
  - /consul-bin/consul
  - connect 
  - envoy
  - -gateway=ingress
  - -proxy-id=$(POD_NAME)
  - -address=$(POD_IP):21000

to :

command:
  - sh
  - -ec
  - |
    /consul-bin/consul connect envoy -gateway=ingress -proxy-id=$(POD_NAME) -address=$(POD_IP):21000 -bootstrap > /tmp/bootstrap.json
    /usr/local/bin/envoy --config-path /tmp/bootstrap.json --disable-hot-restart

If you have any hints, i will be happy to test anything

david-yu commented 3 years ago

Hi @rrondeau would it be possible to a kubectl describe pod on the the two failing ingress gateway pods? Wondering if there are any hints there.

consul-back-ingress-gateway-675659dd45-chmzf                  1/2     CrashLoopBackOff   13         30m
consul-front-ingress-gateway-79995bb7b4-87kmv                 1/2     CrashLoopBackOff   9          15m
rrondeau commented 3 years ago

Sorry for the delay :/

Here is a describe of one pod failing without my workaround :


Namespace:    consul
Priority:     0
Node:         gke-sue-gke-cluster0-sue-gke-cluster0-7e4ade2d-f4bz/10.100.100.61
Start Time:   Wed, 02 Jun 2021 10:19:03 +0000
Labels:       app=consul
              app.kubernetes.io/managed-by=spinnaker
              app.kubernetes.io/name=consul
              chart=consul-helm
              component=ingress-gateway
              heritage=Helm
              ingress-gateway-name=consul-back-ingress-gateway
              pod-template-hash=89c4f8559
              release=consul
Annotations:  artifact.spinnaker.io/location: consul
              artifact.spinnaker.io/name: consul-back-ingress-gateway
              artifact.spinnaker.io/type: kubernetes/deployment
              artifact.spinnaker.io/version: 
              cni.projectcalico.org/podIP: 10.110.3.40/32
              consul.hashicorp.com/connect-inject: false
              moniker.spinnaker.io/application: consul
              moniker.spinnaker.io/cluster: deployment consul-back-ingress-gateway
Status:       Running
IP:           10.110.3.40
IPs:
  IP:           10.110.3.40
Controlled By:  ReplicaSet/consul-back-ingress-gateway-89c4f8559
Init Containers:
  copy-consul-bin:
    Container ID:  docker://091a9181705914a170a656aac0c5eedb9b6e65287f2bbe7a3771b3b9d68fb015
    Image:         docker.io/hashicorp/consul:1.9.5
    Image ID:      docker-pullable://hashicorp/consul@sha256:35f1bdb6c516a4fae6e4b056b0d4e9ddd0a3874efc43fc0dc8db49ef2b5d4442
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
      /bin/consul
      /consul-bin/consul
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 02 Jun 2021 10:19:08 +0000
      Finished:     Wed, 02 Jun 2021 10:19:11 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  150Mi
    Requests:
      cpu:        50m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /consul-bin from consul-bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-back-ingress-gateway-token-dkwqt (ro)
  service-init:
    Container ID:  docker://f2b3ab36229231575cb41be2f37ed593c4db218bf56758d132017ffb3b233243
    Image:         hashicorp/consul-k8s:0.25.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:66a1dfd964e9a8fe2477803462fd08cb83744a65f2b8083e1c51c580f6930c7d
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      consul-k8s service-address \
        -k8s-namespace=consul \
        -name=consul-back-ingress-gateway \
        -output-file=/tmp/address.txt
      WAN_ADDR="$(cat /tmp/address.txt)"
      WAN_PORT=8080

      cat > /consul/service/service.hcl << EOF
      service {
        kind = "ingress-gateway"
        name = "back-ingress-gateway"
        id = "${POD_NAME}"
        port = ${WAN_PORT}
        address = "${WAN_ADDR}"
        tagged_addresses {
          lan {
            address = "${POD_IP}"
            port = 21000
          }
          wan {
            address = "${WAN_ADDR}"
            port = ${WAN_PORT}
          }
        }
        proxy {
          config {
            envoy_gateway_no_default_bind = true
            envoy_gateway_bind_addresses {
              all-interfaces {
                address = "0.0.0.0"
              }
            }
          }
        }
        checks = [
          {
            name = "Ingress Gateway Listening"
            interval = "10s"
            tcp = "${POD_IP}:21000"
            deregister_critical_service_after = "6h"
          }
        ]
      }
      EOF

      /consul-bin/consul services register \
        /consul/service/service.hcl

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 02 Jun 2021 10:19:13 +0000
      Finished:     Wed, 02 Jun 2021 10:19:18 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:     50m
      memory:  50Mi
    Environment:
      HOST_IP:            (v1:status.hostIP)
      POD_IP:             (v1:status.podIP)
      POD_NAME:          consul-back-ingress-gateway-89c4f8559-9cxsq (v1:metadata.name)
      CONSUL_HTTP_ADDR:  http://$(HOST_IP):8500
    Mounts:
      /consul-bin from consul-bin (rw)
      /consul/service from consul-service (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-back-ingress-gateway-token-dkwqt (ro)
Containers:
  ingress-gateway:
    Container ID:  docker://d856c727faa63bd588e91f161207d60f5a524eeba3df20b66bf0a24d6c373289
    Image:         envoyproxy/envoy-alpine:v1.16.3
    Image ID:      docker-pullable://envoyproxy/envoy-alpine@sha256:a11d7329678617c1b29ee28392c76c6ac00ecc55266b0866b7c99f6a7717f9a6
    Ports:         21000/TCP, 8080/TCP, 8443/TCP, 9102/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /consul-bin/consul
      connect
      envoy
      -gateway=ingress
      -proxy-id=$(POD_NAME)
      -address=$(POD_IP):21000
    State:          Running
      Started:      Wed, 02 Jun 2021 10:24:48 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 02 Jun 2021 10:23:57 +0000
      Finished:     Wed, 02 Jun 2021 10:23:58 +0000
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Liveness:   tcp-socket :21000 delay=30s timeout=5s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :21000 delay=10s timeout=5s period=10s #success=1 #failure=3
    Environment:
      HOST_IP:            (v1:status.hostIP)
      POD_IP:             (v1:status.podIP)
      POD_NAME:          consul-back-ingress-gateway-89c4f8559-9cxsq (v1:metadata.name)
      CONSUL_HTTP_ADDR:  http://$(HOST_IP):8500
      CONSUL_GRPC_ADDR:  $(HOST_IP):8502
    Mounts:
      /consul-bin from consul-bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-back-ingress-gateway-token-dkwqt (ro)
  consul-sidecar:
    Container ID:  docker://617a2892d93fc1acdd963d4e01c3d8e58c145de92c773fc6d2f53051881caf6b
    Image:         hashicorp/consul-k8s:0.25.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:66a1dfd964e9a8fe2477803462fd08cb83744a65f2b8083e1c51c580f6930c7d
    Port:          <none>
    Host Port:     <none>
    Command:
      consul-k8s
      consul-sidecar
      -service-config=/consul/service/service.hcl
      -consul-binary=/consul-bin/consul
    State:          Running
      Started:      Wed, 02 Jun 2021 10:19:22 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20m
      memory:  50Mi
    Requests:
      cpu:     20m
      memory:  25Mi
    Environment:
      HOST_IP:            (v1:status.hostIP)
      POD_IP:             (v1:status.podIP)
      CONSUL_HTTP_ADDR:  http://$(HOST_IP):8500
    Mounts:
      /consul-bin from consul-bin (rw)
      /consul/service from consul-service (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-back-ingress-gateway-token-dkwqt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  consul-bin:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  consul-service:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  consul-back-ingress-gateway-token-dkwqt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-back-ingress-gateway-token-dkwqt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m27s                  default-scheduler  Successfully assigned consul/consul-back-ingress-gateway-89c4f8559-9cxsq to gke-sue-gke-cluster0-sue-gke-cluster0-7e4ade2d-f4bz
  Normal   Pulled     6m23s                  kubelet            Container image "docker.io/hashicorp/consul:1.9.5" already present on machine
  Normal   Created    6m23s                  kubelet            Created container copy-consul-bin
  Normal   Started    6m22s                  kubelet            Started container copy-consul-bin
  Normal   Pulling    6m19s                  kubelet            Pulling image "hashicorp/consul-k8s:0.25.0"
  Normal   Pulled     6m17s                  kubelet            Successfully pulled image "hashicorp/consul-k8s:0.25.0" in 1.871122892s
  Normal   Created    6m17s                  kubelet            Created container service-init
  Normal   Started    6m17s                  kubelet            Started container service-init
  Normal   Pulling    6m11s                  kubelet            Pulling image "envoyproxy/envoy-alpine:v1.16.3"
  Normal   Pulled     6m10s                  kubelet            Successfully pulled image "envoyproxy/envoy-alpine:v1.16.3" in 1.708508616s
  Normal   Created    6m9s                   kubelet            Created container consul-sidecar
  Normal   Started    6m9s                   kubelet            Started container ingress-gateway
  Normal   Pulled     6m9s                   kubelet            Container image "hashicorp/consul-k8s:0.25.0" already present on machine
  Normal   Created    6m9s                   kubelet            Created container ingress-gateway
  Normal   Started    6m8s                   kubelet            Started container consul-sidecar
  Warning  Unhealthy  5m14s (x3 over 5m34s)  kubelet            Liveness probe failed: dial tcp 10.110.3.40:21000: connect: connection refused
  Normal   Killing    5m14s                  kubelet            Container ingress-gateway failed liveness probe, will be restarted
  Warning  Unhealthy  5m9s (x6 over 5m59s)   kubelet            Readiness probe failed: dial tcp 10.110.3.40:21000: connect: connection refused
  Normal   Pulled     5m3s                   kubelet            Container image "envoyproxy/envoy-alpine:v1.16.3" already present on machine
  Warning  BackOff    70s (x4 over 91s)      kubelet            Back-off restarting failed container```
rrondeau commented 3 years ago

My issue is becoming strange. I left the broken deployment and its up and running this morning. My kubernetes node pool is destroyed and recreated every night/morning. Tried a rollout restart, failing again, no logs.

rrondeau commented 3 years ago

Just saw this https://github.com/hashicorp/consul/pull/10324 Waiting for the release to test if it solve my issue

david-yu commented 3 years ago

Thanks! The 1.9.6 release just landed: https://github.com/hashicorp/consul/releases/tag/v1.9.6, I wonder if this will solve the issue as well. I am a little at loss at what might be the issue here.

rrondeau commented 3 years ago

Fixed by https://github.com/hashicorp/consul/pull/10324

Thanks !