consul-connect-injector-webhook can not be restarted

enuoCM commented 3 years ago

Overview of the Issue

After the node which running consul-connect-injector-webhook pod down, then consul-connect-injector-webhook can not restarted with followings error:

Reproduction Steps

Steps to reproduce this issue, eg:

When running helm install with the following values.yml:


global:
domain: consul
datacenter: dc1
tls:
 caCert:
  secretName: consul-ca-cert
  secretKey: tls.crt
 caKey:
  secretName: consul-ca-key
  secretKey: tls.key
server:
replicas: 3
bootstrapExpect: 3
connectInject:
enabled: true
controller:
enabled: true

1. View error

### Logs
<details>
  <summary>Logs</summary>

Listening on ":8080"... Updated certificate bundle received. Updating certs... 2021/03/19 08:39:18 http: TLS handshake error from 10.244.2.1:48212: No certificate available. I0319 08:39:28.406623 1 trace.go:201] Trace[74037608]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.18.6/tools/cache/reflector.go:125 (19-Mar-2021 08:39:00.607) (total time: 10284ms): Trace[74037608]: ---"Objects listed" 10184ms (08:39:00.307) Trace[74037608]: [10.284341088s] [10.284341088s] END I0319 08:39:28.609485 1 trace.go:201] Trace[273790779]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.18.6/tools/cache/reflector.go:125 (19-Mar-2021 08:39:00.691) (total time: 10402ms): Trace[273790779]: ---"Objects listed" 10402ms (08:39:00.608) Trace[273790779]: [10.402786534s] [10.402786534s] END



</details>

### Expected behavior

consul-connect-injector-webhook can be restarted

### Environment details

If not already included, please provide the following:
- `consul-k8s` version: "0.24.0"
- `consul-helm` version:  "0.30.0"

kschoche commented 3 years ago

Hi @enuoCM thanks for filing this issue! Could you provide a bit more information here about your setup? What type of k8s provider are you using and what type of nodes? Are these nodes under heavy utilization? Are other pods having issues? It looks like the k8s cluster is unhealthy in general based on the client-go messages.

enuoCM commented 3 years ago

Hi @kschoche I used k8s v1.18 and several virtual machine with Ubuntu 16.04.3 LTS as the nodes. When this issue happened, nodes except the died node were not under heavy utilization and other consul pods worked well. I deleted the injector-webhook pod, it was redeployed to other node but still failed. I have to uninstall and reinstall consul completely to recover this issue. Now this issue is not happened. I close this issue first. If meet again, I'll post more information. Thanks for your response.

enuoCM commented 3 years ago

I have to reopen this issue.

new log:

Listening on ":8080"...
Error loading TLS keypair: tls: failed to find any PEM data in certificate input
Updated certificate bundle received. Updating certs...
2021-04-12T13:27:20.771Z [ERROR] healthCheckResource: unable to update pod: err="unable to get agent health checks: serviceID=***, checkID=***/kubernetes-health-check, getting check "***/kubernetes-health-check": Get "https://10.12.32.111:8501/v1/agent/checks?filter=CheckID+%3D%3D+%60***%2Fkubernetes-health-check%60": dial tcp 10.12.32.111:8501: connect: no route to host"
......
2021-04-12T13:28:40.308Z [ERROR] healthCheckResource: unable to update pod: err="unable to get agent health checks: serviceID=***, checkID=***/kubernetes-health-check, getting check "***/kubernetes-health-check": Get "https://10.12.32.108:8501/v1/agent/checks?filter=CheckID+%3D%3D+%60***%2Fkubernetes-health-check%60": dial tcp 10.12.32.108:8501: i/o timeout"
......
2021-04-12T13:28:53.213Z [ERROR] healthCheckController: failed processing item, retrying: key=*** error="unable to get agent health checks: serviceID=**, checkID=***/kubernetes-health-check, getting check "***/kubernetes-health-check": Get "https://10.12.32.109:8501/v1/agent/checks?filter=CheckID+%3D%3D+%60***%2Fkubernetes-health-check%60": write tcp 10.244.6.23:34316->10.12.32.109:8501: write: broken pipe"
terminated received, shutting down
Error listening: http: Server closed
2021-04-12T13:29:04.439Z [INFO]  cleanupResource: received stop signal, shutting down
2021/04/12 13:29:04 [ERROR] helper/cert: error loading next cert: context canceled

pod status:

NAME                                                          READY   STATUS    RESTARTS   AGE
consul-cb5dl                                                  1/1     Running   1          20d
consul-cbn7z                                                  1/1     Running   0          24d
consul-connect-injector-webhook-deployment-77dc7c59c7-8k9m5   1/1     Running   32         137m
consul-controller-74d74c4f8d-7njx6                            1/1     Running   108        24d
consul-d5j5g                                                  1/1     Running   1          20d
consul-h946v                                                  1/1     Running   0          24d
consul-jp2d8                                                  1/1     Running   0          17d
consul-l7r9b                                                  1/1     Running   4          24d
consul-n9jqp                                                  1/1     Running   0          24d
consul-nbfkm                                                  1/1     Running   1          24d
consul-qhjhv                                                  1/1     Running   7          20d
consul-rvtsk                                                  1/1     Running   0          20d
consul-server-0                                               1/1     Running   1          24d
consul-server-1                                               1/1     Running   0          24d
consul-server-2                                               1/1     Running   0          14d
consul-vdqg4                                                  1/1     Running   2          24d
consul-webhook-cert-manager-7d5f886775-kl684                  1/1     Running   0          24d

enuoCM commented 3 years ago

current pod status:

NAME                                                          READY   STATUS    RESTARTS   AGE
consul-cb5dl                                                  1/1     Running   1          20d
consul-cbn7z                                                  1/1     Running   0          24d
consul-connect-injector-webhook-deployment-77dc7c59c7-8k9m5   1/1     Running   217        15h
consul-controller-74d74c4f8d-7njx6                            1/1     Running   109        24d
consul-d5j5g                                                  1/1     Running   1          20d
consul-h946v                                                  1/1     Running   0          24d
consul-jp2d8                                                  1/1     Running   0          18d
consul-l7r9b                                                  1/1     Running   4          24d
consul-n9jqp                                                  1/1     Running   0          24d
consul-nbfkm                                                  1/1     Running   1          24d
consul-qhjhv                                                  1/1     Running   7          20d
consul-rvtsk                                                  1/1     Running   0          20d
consul-server-0                                               1/1     Running   1          24d
consul-server-1                                               1/1     Running   0          24d
consul-server-2                                               1/1     Running   0          14d
consul-vdqg4                                                  1/1     Running   2          24d
consul-webhook-cert-manager-7d5f886775-kl684                  1/1     Running   0          24d

BTW, new service pod can be auto injected during ready status 1/1 of consul-connect-injector-webhook. The 1/1 Running status may last 1 minute or even less.

HofmannZ commented 3 years ago

Hi @kschoche I used k8s v1.18 and several virtual machine with Ubuntu 16.04.3 LTS as the nodes. When this issue happened, nodes except the died node were not under heavy utilization and other consul pods worked well. I deleted the injector-webhook pod, it was redeployed to other node but still failed. I have to uninstall and reinstall consul completely to recover this issue. Now this issue is not happened. I close this issue first. If meet again, I'll post more information. Thanks for your response.

We are facing the same issue.

We have been running consul in our cluster on production for a while. But as soon as the consul-connect-injector-webhook-deployment restarts it fails to come back up. (We first saw this when our node auto scaler had to reschedule it. But with a can be replicated with a simple kubectl rollout restart.)

Reinstalling Consul did solve the issue, for now... but can guarantee that it will go down again next time it needs to reschedule.

Any ideas what's causing it?

HofmannZ commented 3 years ago

CC @kschoche

HofmannZ commented 3 years ago

Found out what's happening to us:

Admission webhooks and custom resource conversion webhooks using invalid serving certificates that do not contain the server name in a subjectAltName extension cannot be contacted by the Kubernetes API server in 1.19 prior to version 1.19.9-gke.400. This will be resolved in version 1.19.9-gke.400, and automatic upgrades from 1.18 to 1.19 will not begin until this issue is resolved. However, affected webhooks should work to correct their serving certificates in order to work correctly with Kubernetes version 1.22 and later.

Source: https://cloud.google.com/kubernetes-engine/docs/release-notes#119_ga

lkysow commented 3 years ago

@HofmannZ that's weird because we do set the SAN in our webhook certs AFAICT. Could you open up another issue about this so we can investigate?

Here I see the SANs are correct using the latest release:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            74:24:ca:0d:e2:2b:af:1e:d9:ec:88:1e:91:95:87:02:12:1d:5c:32
    Signature Algorithm: ecdsa-with-SHA256
        Issuer: C=US, ST=CA, L=San Francisco/street=101 Second Street/postalCode=94105, O=HashiCorp Inc., CN=Connect Inject CA
        Validity
            Not Before: Apr 14 16:35:47 2021 GMT
            Not After : Apr 15 16:36:47 2021 GMT
        Subject: CN=Connect Inject Service
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:d2:5f:50:a3:9e:60:3e:93:0b:75:97:c2:26:ae:
                    7a:31:a6:5b:51:1d:a3:8f:fb:b6:25:99:a3:3f:ee:
                    2f:e3:f1:b9:d1:6d:84:f8:f4:1f:e3:ea:e3:ea:1b:
                    6e:e2:de:58:ad:47:4c:3f:22:5e:6b:70:09:c2:15:
                    6c:2b:0c:03:68
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier:
                keyid:66:65:3A:39:32:3A:64:64:3A:30:31:3A:35:33:3A:32:36:3A:32:32:3A:35:32:3A:37:39:3A:34:35:3A:30:34:3A:39:32:3A:66:33:3A:63:38:3A:33:36:3A:32:36:3A:38:34:3A:65:62:3A:31:64:3A:38:38:3A:33:38:3A:37:32:3A:62:31:3A:62:34:3A:62:37:3A:33:31:3A:62:66:3A:33:38:3A:39:34:3A:66:32:3A:30:30:3A:64:62

            X509v3 Subject Alternative Name:
                DNS:consul-consul-connect-injector-svc, DNS:consul-consul-connect-injector-svc.default, DNS:consul-consul-connect-injector-svc.default.svc
    Signature Algorithm: ecdsa-with-SHA256
         30:45:02:20:54:52:c2:d5:04:e3:7e:2b:48:48:38:9a:39:3d:
         bd:ba:b1:9b:b6:cf:17:d5:54:00:74:92:40:fd:51:40:0c:ed:
         02:21:00:d8:63:f4:20:87:eb:d2:e3:44:80:ce:4a:86:04:16:
         2d:31:53:f3:93:53:ce:d5:7c:13:0f:40:80:40:7b:90:c4

lkysow commented 3 years ago

@enuoCM can you run a kubectl describe on those pods. I'm particularly interested in the events. I wonder if they're being restarted due to failing liveness probes which may actually be fixed by https://github.com/hashicorp/consul-helm/pull/885

HofmannZ commented 3 years ago

@lkysow I think you are right, will open a new issue on it.

HofmannZ commented 3 years ago

@lkysow opened issue #911.

enuoCM commented 3 years ago

@enuoCM can you run a kubectl describe on those pods. I'm particularly interested in the events. I wonder if they're being restarted due to failing liveness probes which may actually be fixed by #885

@lkysow I manually edited to remove the livenessProbe and readinessProbe of the injector deployment to have a test the day before yesterday. The modified injector pod still kept to restart, but with lower frequency. I deleted the pod this today before I saw your post. It's strange that the new injector pod without livenessProbe and readinessProbe doesn't restart until now.

NAME                                                          READY   STATUS    RESTARTS   AGE
consul-connect-injector-webhook-deployment-779d668f65-nnrc9 1/1 Running 0 54m

However we have another dev env, which with original injector pod and restarted three times. Here is the status:

NAME                                                          READY   STATUS    RESTARTS   AGE
consul-connect-injector-webhook-deployment-77dc7c59c7-52cjx 1/1 Running 3 4d3h

kubectl describe result:

Name:         consul-connect-injector-webhook-deployment-77dc7c59c7-52cjx
Namespace:    baas-consul
Priority:     0
Node:         k8s-node8/192.168.100.20
Start Time:   Mon, 12 Apr 2021 10:45:44 +0800
Labels:       app=consul
              chart=consul-helm
              component=connect-injector
              pod-template-hash=77dc7c59c7
              release=baas
Annotations:  consul.hashicorp.com/connect-inject: false
Status:       Running
IP:           10.244.14.6
IPs:
  IP:           10.244.14.6
Controlled By:  ReplicaSet/consul-connect-injector-webhook-deployment-77dc7c59c7
Init Containers:
  get-auto-encrypt-client-ca:
    Container ID:  docker://386175315da35ed334ff7de929e49c46c9cc187bfbcfd9958123bef8fe384eff
    Image:         consul-k8s:0.24.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:aab52eed946801f8b3ed25e0b5853475cb35ac38ab372fadb61ce8af8f372663
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      consul-k8s get-consul-client-ca \
        -output-file=/consul/tls/client/ca/tls.crt \
        -server-addr=consul-server \
        -server-port=8501 \
        -ca-file=/consul/tls/ca/tls.crt

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 12 Apr 2021 11:06:13 +0800
      Finished:     Mon, 12 Apr 2021 11:06:24 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:        50m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /consul/tls/ca from consul-ca-cert (rw)
      /consul/tls/client/ca from consul-auto-encrypt-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-connect-injector-webhook-svc-account-token-lffzk (ro)
Containers:
  sidecar-injector:
    Container ID:  docker://8114493d7aa2e1345e5ee702bb0c28c41d5b3bb5779ae492a35725a4ea6c0e17
    Image:         consul-k8s:0.24.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:aab52eed946801f8b3ed25e0b5853475cb35ac38ab372fadb61ce8af8f372663
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="consul"

      consul-k8s inject-connect \
        -default-inject=false \
        -consul-image="hashicorp/consul:1.9.3" \
        -envoy-image="hashicorp/envoy-alpine:1.16.0" \
        -consul-k8s-image="consul-k8s:0.24.0" \
        -listen=:8080 \
        -log-level=info \
        -enable-health-checks-controller=true \
        -health-checks-reconcile-period=1m \
        -cleanup-controller-reconcile-period=5m \
        -allow-k8s-namespace="*" \
        -tls-auto=${CONSUL_FULLNAME}-connect-injector-cfg \
        -tls-auto-hosts=${CONSUL_FULLNAME}-connect-injector-svc,${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE},${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE}.svc \
        -init-container-memory-limit=150Mi \
        -init-container-memory-request=25Mi \
        -init-container-cpu-limit=50m \
        -init-container-cpu-request=50m \
        -consul-sidecar-memory-limit=50Mi \
        -consul-sidecar-memory-request=25Mi \
        -consul-sidecar-cpu-limit=20m \
        -consul-sidecar-cpu-request=20m \

    State:          Running
      Started:      Mon, 12 Apr 2021 14:55:55 +0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 12 Apr 2021 14:55:21 +0800
      Finished:     Mon, 12 Apr 2021 14:55:25 +0800
    Ready:          True
    Restart Count:  3
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:      50m
      memory:   50Mi
    Liveness:   http-get https://:8080/health/ready delay=1s timeout=5s period=2s #success=1 #failure=2
    Readiness:  http-get https://:8080/health/ready delay=2s timeout=5s period=2s #success=1 #failure=2
    Environment:
      NAMESPACE:         baas-consul (v1:metadata.namespace)
      CONSUL_CACERT:     /consul/tls/ca/tls.crt
      HOST_IP:            (v1:status.hostIP)
      CONSUL_HTTP_ADDR:  https://$(HOST_IP):8501
    Mounts:
      /consul/tls/ca from consul-auto-encrypt-ca-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-connect-injector-webhook-svc-account-token-lffzk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  consul-ca-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-ca-cert
    Optional:    false
  consul-auto-encrypt-ca-cert:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  consul-connect-injector-webhook-svc-account-token-lffzk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-connect-injector-webhook-svc-account-token-lffzk
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

enuoCM commented 3 years ago

@lkysow the injector pod (consul-connect-injector-webhook-deployment-779d668f65-nnrc9 ) without livenessProbe and readinessProbe was killed by OOM after ran 6h5m, and the new restart pod kept to restart:

NAME                                                          READY   STATUS             RESTARTS   AGE 
consul-connect-injector-webhook-deployment-779d668f65-jbh69   0/1     CrashLoopBackOff   443        2d6h

kubectl describe result:

Name:         consul-connect-injector-webhook-deployment-779d668f65-jbh69
Namespace:    baas-consul
Priority:     0
Node:         lpc-node2/192.168.240.193
Start Time:   Sat, 17 Apr 2021 03:37:55 +0800
Labels:       app=consul
              chart=consul-helm
              component=connect-injector
              pod-template-hash=779d668f65
              release=baas
Annotations:  consul.hashicorp.com/connect-inject: false
Status:       Running
IP:           10.244.2.250
IPs:
  IP:           10.244.2.250
Controlled By:  ReplicaSet/consul-connect-injector-webhook-deployment-779d668f65
Init Containers:
  get-auto-encrypt-client-ca:
    Container ID:  docker://acbe8626c0f5898642546144be58fcee21e07da55b3dd5c9a86e1ccc3895f645
    Image:         hashicorp/consul-k8s:0.24.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:62311934fae90a0deaa39702725c1b9b0aaa5c18e4a82127d8ac86e28786934e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      consul-k8s get-consul-client-ca \
        -output-file=/consul/tls/client/ca/tls.crt \
        -server-addr=consul-server \
        -server-port=8501 \
        -ca-file=/consul/tls/ca/tls.crt

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 17 Apr 2021 03:38:10 +0800
      Finished:     Sat, 17 Apr 2021 03:38:12 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:        50m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /consul/tls/ca from consul-ca-cert (rw)
      /consul/tls/client/ca from consul-auto-encrypt-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-connect-injector-webhook-svc-account-token-5cd5c (ro)
Containers:
  sidecar-injector:
    Container ID:  docker://e5a6848e6aff3be6b4ebe658501640fde0f13e2671e8d2f13e8ba994f8168955
    Image:         hashicorp/consul-k8s:0.24.0
    Image ID:      docker-pullable://hashicorp/consul-k8s@sha256:62311934fae90a0deaa39702725c1b9b0aaa5c18e4a82127d8ac86e28786934e
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="consul"

      consul-k8s inject-connect \
        -default-inject=false \
        -consul-image="hashicorp/consul:1.9.3" \
        -envoy-image="hashicorp/envoy-alpine:1.16.0" \
        -consul-k8s-image="hashicorp/consul-k8s:0.24.0" \
        -listen=:8080 \
        -log-level=debug \
        -enable-health-checks-controller=true \
        -health-checks-reconcile-period=1m \
        -cleanup-controller-reconcile-period=5m \
        -allow-k8s-namespace="*" \
        -tls-auto=${CONSUL_FULLNAME}-connect-injector-cfg \
        -tls-auto-hosts=${CONSUL_FULLNAME}-connect-injector-svc,${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE},${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE}.svc \
        -init-container-memory-limit=150Mi \
        -init-container-memory-request=25Mi \
        -init-container-cpu-limit=50m \
        -init-container-cpu-request=50m \
        -consul-sidecar-memory-limit=50Mi \
        -consul-sidecar-memory-request=25Mi \
        -consul-sidecar-cpu-limit=20m \
        -consul-sidecar-cpu-request=20m \

    State:          Running
      Started:      Mon, 19 Apr 2021 09:48:32 +0800
    Ready:          True
    Restart Count:  444
    Limits:
      cpu:     50m
      memory:  50Mi
    Requests:
      cpu:     50m
      memory:  50Mi
    Environment:
      NAMESPACE:         baas-consul (v1:metadata.namespace)
      CONSUL_CACERT:     /consul/tls/ca/tls.crt
      HOST_IP:            (v1:status.hostIP)
      CONSUL_HTTP_ADDR:  https://$(HOST_IP):8501
    Mounts:
      /consul/tls/ca from consul-auto-encrypt-ca-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from consul-connect-injector-webhook-svc-account-token-5cd5c (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  consul-ca-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-ca-cert
    Optional:    false
  consul-auto-encrypt-ca-cert:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  consul-connect-injector-webhook-svc-account-token-5cd5c:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  consul-connect-injector-webhook-svc-account-token-5cd5c
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                      From                Message
  ----     ------   ----                     ----                -------
  Normal   Pulled   58m (x436 over 2d6h)     kubelet, lpc-node2  Container image "hashicorp/consul-k8s:0.24.0" already present on machine
  Normal   Created  58m (x436 over 2d6h)     kubelet, lpc-node2  Created container sidecar-injector
  Normal   Started  58m (x436 over 2d6h)     kubelet, lpc-node2  Started container sidecar-injector
  Warning  BackOff  4m23s (x4002 over 2d6h)  kubelet, lpc-node2  Back-off restarting failed container

BTW, does the fix of #885 revert by later commit?

lkysow commented 3 years ago

was killed by OOM after ran 6h5m

Can you use kubectl logs with the -p (previous) flag so we can see what's causing it to crash loop?

If it's OOM then you can increase the memory in the helm values file:

connectInject:
  resources:
    requests:
      memory: "50Mi"
      cpu: "50m"
    limits:
      memory: "50Mi"
      cpu: "50m"

I'm curious if you have a high load in your cluster with lots of services?

BTW, does the fix of #885 revert by later commit?

Yes sorry about that. We're going to add back the probes but we had to remove them just for beta.

enuoCM commented 3 years ago

Can you use kubectl logs with the -p (previous) flag so we can see what's causing it to crash loop?

Sorry, consul-connect-injector-webhook-deployment-779d668f65-jbh69 (without livenessProbe and readinessProbe ) has been deleted yesterday. New consul-connect-injector-webhook-deployment-7c7b8798d5-m89pw pod (without livenessProbe and readinessProbe) and with following settings has been running 32h without restarting.
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi

Here is the previous log of consul-connect-injector-webhook-deployment-77dc7c59c7-52cjx (with default config) in another cluster(has a few services): It looks more robust.

Listening on ":8080"...
2021/04/12 06:55:24 http: TLS handshake error from 10.244.14.1:23300: No certificate available.
Error loading TLS keypair: tls: failed to find any PEM data in certificate input
Updated certificate bundle received. Updating certs...
2021/04/12 06:55:25 http: TLS handshake error from 10.244.14.1:23304: No certificate available.
terminated received, shutting down
2021/04/12 06:55:25 [ERROR] helper/cert: error loading next cert: context canceled

If it's OOM then you can increase the memory in the helm values file:
connectInject:
  resources:
    requests:
      memory: "50Mi"
      cpu: "50m"
    limits:
      memory: "50Mi"
      cpu: "50m"
Thanks, I did this yesterday, it (consul-connect-injector-webhook-deployment-7c7b8798d5-m89pw) works without restarting until now!

I'm curious if you have a high load in your cluster with lots of services?

We have 60+ services currently. It will increase more in the feature, maybe up to thousands. Is cousul OK for these number of services?

lkysow commented 3 years ago

terminated received, shutting down

This is kube killing it. We've typically seen this due to the liveness probes. OOM would be a hard-kill, it wouldn't send it a signal I don't think.

We have 60+ services currently. It will increase more in the feature, maybe up to thousands. Is cousul OK for these number of services?

Yes absolutely it's okay, however the resources look to need tweaking. We haven't yet done testing on recommended settings for larger number of services.

How many pods are you running in those 60+ services?

enuoCM commented 3 years ago

This is kube killing it. We've typically seen this due to the liveness probes. OOM would be a hard-kill, it wouldn't send it a signal I don't think.

Yes, I agree. This is not the OOM log.

How many pods are you running in those 60+ services?

One pod each service.

lkysow commented 3 years ago

Do you happen to have CPU/Memory graphs of the inject pod now that you've raised its limits?

enuoCM commented 3 years ago

Do you happen to have CPU/Memory graphs of the inject pod now that you've raised its limits?

No, I don't have graphs statistics. We consider to install 3rd tools in feature. I'll share here if I get some metrics.

david-yu commented 3 years ago

Hi @enuoCM Unfortunately we still are not able to reproduce this issue. Could you give our latest version of Consul K8s a look and see if this is still a problem?

david-yu commented 3 years ago

Hi @enuoCM we will close as we have not seen any feedback. Please re-open if this still an issue our latest version of Consul Kubernetes!

hashicorp / consul-helm

consul-connect-injector-webhook can not be restarted #874

Overview of the Issue

Reproduction Steps