hashicorp / vault-k8s

First-class support for Vault and Kubernetes.
Mozilla Public License 2.0
790 stars 171 forks source link

Agent injector CrashLoopBackOff when replicas>1 : TLS handshake error - no certificate available #388

Closed masterphenix closed 1 year ago

masterphenix commented 2 years ago

Describe the bug I have deployed the vault-agent-injector using Helm, with auto-TLS and 2 replicas, and both pods go into CrashLoopBackOff. Logs show the following errors:

[...]
2022-10-06T08:12:19.903Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-10-06T08:12:19.903Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:23.176Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59704: no certificate available
2022-10-06T08:12:25.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:59702: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49730: no certificate available
2022-10-06T08:12:27.029Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49728: no certificate available
2022-10-06T08:12:27.031Z [ERROR] handler: http: TLS handshake error from 10.102.8.55:49732: no certificate available

A describe on the pods shows both liveness and readiness failing:

  Warning  Unhealthy         46m (x5 over 47m)      kubelet            Liveness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error
  Warning  Unhealthy         46m (x9 over 47m)      kubelet            Readiness probe failed: Get "https://10.102.8.59:8080/health/ready": remote error: tls: internal error

Also, the vault-injector-certs secret is of type "Opaque" and shows no data, which doesn't seem right.

To Reproduce Steps to reproduce the behavior:

  1. Deploy vault agent injector using the vault Chart version 0.22.0, and the following values:
    injector:
    enabled: true
    replicas: 2
    leaderElector:
    enabled: true
    metrics:
    enabled: true
    image:
    repository: "hashicorp/vault-k8s"
    tag: "1.0.0"
    pullPolicy: IfNotPresent
    agentImage:
    repository: "hashicorp/vault"
    tag: "1.11.3"
    authPath: "auth/azure"
    certs:
    secretName: null
    caBundle: ""
    certName: tls.crt
    keyName: tls.key
  2. Logs in pods show the errors above

Application deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-agent-injector
    component: webhook
    helm.toolkit.fluxcd.io/name: vault
    helm.toolkit.fluxcd.io/namespace: vault
  name: vault-agent-injector
  namespace: vault
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: vault
      app.kubernetes.io/name: vault-agent-injector
      component: webhook
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        aadpodidbinding: vault-binding
        app.kubernetes.io/instance: vault
        app.kubernetes.io/name: vault-agent-injector
        component: webhook
        maintainer: team-ops
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: vault
                app.kubernetes.io/name: vault-agent-injector
                component: webhook
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - agent-inject
        - 2>&1
        env:
        - name: AGENT_INJECT_LISTEN
          value: :8080
        - name: AGENT_INJECT_LOG_LEVEL
          value: info
        - name: AGENT_INJECT_VAULT_ADDR
          value: https://xxxxxx.hashicorp.cloud:8200/
        - name: AGENT_INJECT_VAULT_AUTH_PATH
          value: auth/azure
        - name: AGENT_INJECT_VAULT_IMAGE
          value: hashicorp/vault:1.11.3
        - name: AGENT_INJECT_TLS_AUTO
          value: vault-agent-injector-cfg
        - name: AGENT_INJECT_TLS_AUTO_HOSTS
          value: vault-agent-injector-svc,vault-agent-injector-svc.vault,vault-agent-injector-svc.vault.svc
        - name: AGENT_INJECT_LOG_FORMAT
          value: standard
        - name: AGENT_INJECT_REVOKE_ON_SHUTDOWN
          value: "false"
        - name: AGENT_INJECT_TELEMETRY_PATH
          value: /metrics
        - name: AGENT_INJECT_USE_LEADER_ELECTOR
          value: "true"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: AGENT_INJECT_CPU_REQUEST
          value: 250m
        - name: AGENT_INJECT_CPU_LIMIT
          value: 500m
        - name: AGENT_INJECT_MEM_REQUEST
          value: 64Mi
        - name: AGENT_INJECT_MEM_LIMIT
          value: 128Mi
        - name: AGENT_INJECT_DEFAULT_TEMPLATE
          value: map
        - name: AGENT_INJECT_TEMPLATE_CONFIG_EXIT_ON_RETRY_FAILURE
          value: "true"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: hashicorp/vault-k8s:1.0.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        name: sidecar-injector
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /health/ready
            port: 8080
            scheme: HTTPS
          initialDelaySeconds: 5
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 250m
            memory: 256Mi
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 100
      serviceAccount: vault-agent-injector
      serviceAccountName: vault-agent-injector
      terminationGracePeriodSeconds: 30

Expected behavior Pods should start the same way as they do when replicas=1

Environment

Additional context This is what the vault-k8s-leader configMap looks like:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vault-k8s-leader
  namespace: vault
  ownerReferences:
  - apiVersion: v1
    kind: Pod
    name: vault-agent-injector-cdc566446-4kb8d
    uid: 11338b98-c923-46f3-8181-5e66cea4c71c
tvoran commented 2 years ago

Hi @masterphenix, sorry to hear you're running into trouble with multiple replicas! Those chart values look good to me, though with them I haven't been able to reproduce your issue yet. I wonder if the liveness and readiness timeouts are just too low for your system? It looks like the chart doesn't have those as configurable options yet, so if you can adjust them manually that might shed some light on what the problem is. You may also want to set injector.logLevel: debug to get more info in the logs.

masterphenix commented 2 years ago

Hello @tvoran , thank you for having a look at this. I have tried changing liveness/readiness, and you were right, it started working. As a definitive solution, I have added a startupProbe to give the pods enough time to elect leader and generate certificates.

tvoran commented 2 years ago

That's great to hear! Would you mind sharing what you changed/added to get it working? That will better inform how to go about adding support for this to the chart.

masterphenix commented 2 years ago

Sure, here is the startupProbe I added:

    startupProbe:
      failureThreshold: 12
      httpGet:
        path: /health/ready
        port: 8080
        scheme: HTTPS
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 5

Since I am using Flux, and deploying the injector with the vault helm chart, I used a kustomize postRenderer in the HelmRelease to add the probe:

  postRenderers:
  # Instruct helm-controller to use built-in "kustomize" post renderer.
  - kustomize:
      patchesJson6902:
      - target:
          group: apps
          version: v1
          kind: Deployment
          name: vault-agent-injector
        patch:
        - op: add
          path: /spec/template/spec/containers/0/startupProbe
          value: { "failureThreshold": 12, "initialDelaySeconds": 5, "periodSeconds": 5, "timeoutSeconds": 5, "successThreshold": 1, "httpGet": {"path":"/health/ready", "port": 8080, "scheme": "HTTPS"} }
kiich commented 2 years ago

+1 to say we have just ran into this as well and changing replica to 1 fixed for us. Would be great to workout what the root issue is!

kiich commented 2 years ago

The same config (with regards to probe etc) in other environment for us works fine so we are wondering if it is a race condition thing?

tvoran commented 2 years ago

Yeah I suspect it just takes the injector's leader election a little too long on some systems to establish a leader and generate the certificates for communicating with the k8s API, and so the pod is killed by the liveness probe.

ajiteb commented 2 years ago

I get below even with 1 replica. FYI I'm using external Vault address.

vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:18:29.609Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:47484: read tcp 172.33.0.213:8080->172.33.16.217:47484: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:41:37.807Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:40358: read tcp 172.33.0.213:8080->172.33.16.217:40358: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.826Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42968: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.846Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42972: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.860Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42978: read tcp 172.33.0.213:8080->172.33.16.217:42978: read: connection reset by peer
tvoran commented 2 years ago

Hi @ajiteb that looks like a different issue, so you may want to start by reviewing the connectivity requirements: https://developer.hashicorp.com/vault/docs/platform/k8s/injector/examples#before-using-the-vault-agent-injector

ajiteb commented 2 years ago

@tvoran thanks for your reply. Yes, at least pod doesn't go into CrashLoopBackOff but it stays in running state. Anyways I'm not able to understand the reason for these logs. I'm on EKS 1.23 with vault version 1.12.1.

siwyroot commented 2 years ago

I have exactly same problem on few OpenShift cluster (not all of them) where I can deploy Vault injector with only 1 replica. Secret for leader elector is always empty. Newest chart and newest images. OpenShift 4.8.35

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  namespace: flux-system
  name: vault
spec:
  targetNamespace: ${namespace}
  values:
    fullnameOverride: vault
    global:
      openshift: true
    injector:
      authPath: auth/${cluster_id}
      image:
        repository: hashicorp/vault-k8s
        tag: "1.1"
      agentImage:
        repository: hashicorp/vault
        tag: "1.12.1"
      agentDefaults:
        cpuLimit: "200m"
        cpuRequest: "1m"
      externalVaultAddr: ${vault_url}
      failurePolicy: Fail
      namespaceSelector:
        matchLabels:
          vault-injector-enabled: 'true'
      replicas: 2
  interval: 5m
  chart:
    spec:
      chart: vault
      interval: 1m
      reconcileStrategy: ChartVersion
      sourceRef:
        name: hashicorp
        namespace: flux-system
        kind: HelmRepository
      version: '0.22.1'

Logs:

Using internal leader elector logic for webhook certificate management
Listening on ":8080"...
2022-11-28T17:17:42.609Z [INFO]  handler: Starting handler..
2022-11-28T17:17:42.682Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55786: no certificate available
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Updated certificate bundle received. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO]  handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN]  handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
I1128 17:17:43.643688       1 request.go:682] Waited for 1.042662433s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps/v1?timeout=32s
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55810: no certificate available
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55812: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55824: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55826: no certificate available
tvoran commented 2 years ago

Hi @siwyroot, I wonder if what you're seeing could be related to https://github.com/hashicorp/vault-k8s/issues/378?

siwyroot commented 2 years ago

@tvoran Hi, I think it's same issue. Thanks!

adjain131995 commented 1 year ago

@tvoran we have faced the exact same issue and the hot fix was making the replica count 1. However, this definitely indicates this to be a bug and I believe should be fixed.

tvoran commented 1 year ago

Hi folks, I believe this has been addressed in #852, which was released in v0.24.0.