Closed masterphenix closed 1 year ago
Hi @masterphenix, sorry to hear you're running into trouble with multiple replicas! Those chart values look good to me, though with them I haven't been able to reproduce your issue yet. I wonder if the liveness and readiness timeouts are just too low for your system? It looks like the chart doesn't have those as configurable options yet, so if you can adjust them manually that might shed some light on what the problem is. You may also want to set injector.logLevel: debug
to get more info in the logs.
Hello @tvoran , thank you for having a look at this. I have tried changing liveness/readiness, and you were right, it started working. As a definitive solution, I have added a startupProbe to give the pods enough time to elect leader and generate certificates.
That's great to hear! Would you mind sharing what you changed/added to get it working? That will better inform how to go about adding support for this to the chart.
Sure, here is the startupProbe I added:
startupProbe:
failureThreshold: 12
httpGet:
path: /health/ready
port: 8080
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5
Since I am using Flux, and deploying the injector with the vault helm chart, I used a kustomize postRenderer in the HelmRelease to add the probe:
postRenderers:
# Instruct helm-controller to use built-in "kustomize" post renderer.
- kustomize:
patchesJson6902:
- target:
group: apps
version: v1
kind: Deployment
name: vault-agent-injector
patch:
- op: add
path: /spec/template/spec/containers/0/startupProbe
value: { "failureThreshold": 12, "initialDelaySeconds": 5, "periodSeconds": 5, "timeoutSeconds": 5, "successThreshold": 1, "httpGet": {"path":"/health/ready", "port": 8080, "scheme": "HTTPS"} }
+1 to say we have just ran into this as well and changing replica to 1 fixed for us. Would be great to workout what the root issue is!
The same config (with regards to probe etc) in other environment for us works fine so we are wondering if it is a race condition thing?
Yeah I suspect it just takes the injector's leader election a little too long on some systems to establish a leader and generate the certificates for communicating with the k8s API, and so the pod is killed by the liveness probe.
I get below even with 1 replica. FYI I'm using external Vault address.
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:18:29.609Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:47484: read tcp 172.33.0.213:8080->172.33.16.217:47484: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:41:37.807Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:40358: read tcp 172.33.0.213:8080->172.33.16.217:40358: read: connection reset by peer
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.826Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42968: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.846Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42972: EOF
vault-agent-injector-7b6ffd4845-js52q sidecar-injector 2022-11-16T10:43:29.860Z [ERROR] handler: http: TLS handshake error from 172.33.16.217:42978: read tcp 172.33.0.213:8080->172.33.16.217:42978: read: connection reset by peer
Hi @ajiteb that looks like a different issue, so you may want to start by reviewing the connectivity requirements: https://developer.hashicorp.com/vault/docs/platform/k8s/injector/examples#before-using-the-vault-agent-injector
@tvoran thanks for your reply. Yes, at least pod doesn't go into CrashLoopBackOff but it stays in running state. Anyways I'm not able to understand the reason for these logs. I'm on EKS 1.23 with vault version 1.12.1.
I have exactly same problem on few OpenShift cluster (not all of them) where I can deploy Vault injector with only 1 replica. Secret for leader elector is always empty. Newest chart and newest images. OpenShift 4.8.35
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
namespace: flux-system
name: vault
spec:
targetNamespace: ${namespace}
values:
fullnameOverride: vault
global:
openshift: true
injector:
authPath: auth/${cluster_id}
image:
repository: hashicorp/vault-k8s
tag: "1.1"
agentImage:
repository: hashicorp/vault
tag: "1.12.1"
agentDefaults:
cpuLimit: "200m"
cpuRequest: "1m"
externalVaultAddr: ${vault_url}
failurePolicy: Fail
namespaceSelector:
matchLabels:
vault-injector-enabled: 'true'
replicas: 2
interval: 5m
chart:
spec:
chart: vault
interval: 1m
reconcileStrategy: ChartVersion
sourceRef:
name: hashicorp
namespace: flux-system
kind: HelmRepository
version: '0.22.1'
Logs:
Using internal leader elector logic for webhook certificate management
Listening on ":8080"...
2022-11-28T17:17:42.609Z [INFO] handler: Starting handler..
2022-11-28T17:17:42.682Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55786: no certificate available
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Updated certificate bundle received. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
2022-11-28T17:17:42.709Z [INFO] handler.certwatcher: Webhooks changed. Updating certs...
2022-11-28T17:17:42.709Z [WARN] handler.certwatcher: Could not load TLS keypair: tls: failed to find any PEM data in certificate input. Trying again...
I1128 17:17:43.643688 1 request.go:682] Waited for 1.042662433s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps/v1?timeout=32s
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55810: no certificate available
2022-11-28T17:17:48.079Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55812: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55824: no certificate available
2022-11-28T17:17:50.078Z [ERROR] handler: http: TLS handshake error from 10.131.4.1:55826: no certificate available
Hi @siwyroot, I wonder if what you're seeing could be related to https://github.com/hashicorp/vault-k8s/issues/378?
@tvoran Hi, I think it's same issue. Thanks!
@tvoran we have faced the exact same issue and the hot fix was making the replica count 1. However, this definitely indicates this to be a bug and I believe should be fixed.
Describe the bug I have deployed the vault-agent-injector using Helm, with auto-TLS and 2 replicas, and both pods go into CrashLoopBackOff. Logs show the following errors:
A describe on the pods shows both liveness and readiness failing:
Also, the vault-injector-certs secret is of type "Opaque" and shows no data, which doesn't seem right.
To Reproduce Steps to reproduce the behavior:
Application deployment:
Expected behavior Pods should start the same way as they do when replicas=1
Environment
Additional context This is what the vault-k8s-leader configMap looks like: