Closed L7R7 closed 3 years ago
Hi @L7R7,
Sorry we didn't get back to you at the time. Are you still experiencing this issue? If so, may I suggest that you have the script log the response from the token renewal, and/or enable auditing so you can example the responses that way? I can't explain the behaviour you're seeing but I suspect this would shed some light.
The only possible problem I can think of is some kind of split brain scenario where the successful token renewal isn't done by vault-2 but from another pod, or it's done by vault-2 but not propagated internally.
That shouldn't be possible with Raft. Only the leader can make changes to stored data, and there can only be a leader within a partition that has quorum (i.e. >50% of the nodes). Furthermore, attempts to store data will fail unless they can be persisted to a quorum of nodes.
I'll leave this ticket open while I wait to hear back from you, if I don't after a few weeks I'll assume the issue is resolved.
@ncabatoff Thanks for your answer! I'll close this issue as I haven't seen the problem for months now, so it seems to be resolved. I don't know what it fixed, but seems to be fixed.
and/or enable auditing
This is a good idea, we didn't try that back then. I'll keep that in mind in case we're having issues in the future.
Describe the bug Token renewal doesn't work properly, although the token renewal request via HTTP succeeds. We have a script that renews the token regularly. What I can see in the script's logs is something like the following:
This happens after the token is in use for a couple of days. My interpretation of the logs is:
We're running Vault on Kubernetes using 3 instances, HA mode enabled and with the Raft Storage Backend. In the logs from the vault pods I can see quite a lot of error messages like this:
This is the first one, coming from the pod that's the cluster leader. There are quite a lot of the same messages directly after that. In between I see the following:
I assume that
vault-0
andvault-1
think that there's no leader and try to become a leader.I'm wondering how it's possible that the token renewal succeeds but apparently doesn't. The only possible problem I can think of is some kind of split brain scenario where the successful token renewal isn't done by
vault-2
but from another pod, or it's done byvault-2
but not propagated internally. What's also interesting is the fact thatvault-2
stays the cluster leader the whole time.We can work around this issue by killing the pod that can't renew its token. All our other pods (including the one that is started as a replacement for the killed one) are running fine.
To Reproduce The issue occurs very rarely. I don't see anything obvious on how to reproduce the issue. We observed this behaviour already a couple of weeks ago. Other than that we don't have issues with the Vault installation.
Expected behavior If the token renewal is successful, it should actually be successful.
Environment:
vault status
): 1.5.3 (we observed the same problem with 1.5.2 as well)Helm chart Config
```yaml global: enabled: true imagePullSecrets: [] tlsDisable: false openshift: false injector: enabled: false externalVaultAddr: "" image: repository: "hashicorp/vault-k8s" tag: "0.4.0" pullPolicy: IfNotPresent agentImage: repository: "vault" tag: "1.5.3" authPath: "auth/kubernetes" logLevel: "info" logFormat: "standard" revokeOnShutdown: false namespaceSelector: {} certs: secretName: null caBundle: "" certName: tls.crt keyName: tls.key resources: {} extraEnvironmentVars: {} affinity: null tolerations: null nodeSelector: null priorityClassName: "" server: image: repository: "vault" tag: "1.5.3" pullPolicy: IfNotPresent updateStrategyType: "RollingUpdate" resources: requests: memory: 256Mi cpu: 150m limits: memory: 256Mi cpu: 150m ingress: enabled: true labels: {} annotations: nginx.ingress.kubernetes.io/backend-protocol: HTTPS hosts: - host: sample-host.de paths: [] tls: - secretName: wildcard-host-de hosts: - sample-host.de route: enabled: false labels: {} annotations: {} host: chart-example.local authDelegator: enabled: false extraInitContainers: null extraContainers: null shareProcessNamespace: false extraArgs: "" readinessProbe: enabled: true livenessProbe: enabled: false path: "/v1/sys/health?standbyok=true" initialDelaySeconds: 60 preStopSleepSeconds: 5 postStart: [] extraEnvironmentVars: VAULT_CACERT: /vault/userconfig/vault-server-tls/ca.crt VAULT_LOG_FORMAT: json extraSecretEnvironmentVars: - envName: AWS_ACCESS_KEY_ID secretName: vault-unseal secretKey: AWS_ACCESS_KEY_ID - envName: AWS_SECRET_ACCESS_KEY secretName: vault-unseal secretKey: AWS_SECRET_ACCESS_KEY extraVolumes: - type: secret name: vault-server-tls affinity: | podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: {{ template "vault.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" component: server topologyKey: kubernetes.io/hostname tolerations: null nodeSelector: null priorityClassName: "" extraLabels: {} annotations: {} service: enabled: true port: 8200 targetPort: 8200 annotations: {} dataStorage: enabled: true size: 10Gi storageClass: null accessMode: ReadWriteOnce auditStorage: enabled: false size: 10Gi storageClass: null accessMode: ReadWriteOnce dev: enabled: false standalone: enabled: "-" config: | ui = true listener "tcp" { tls_disable = 1 address = "[::]:8200" cluster_address = "[::]:8201" } storage "file" { path = "/vault/data" } ha: enabled: true replicas: 3 raft: enabled: true setNodeId: true config: | ui = true listener "tcp" { tls_disable = 0 address = "[::]:8200" cluster_address = "[::]:8201" tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt" tls_key_file = "/vault/userconfig/vault-server-tls/vault.key" tls_client_ca_file = "/vault/userconfig/vault-server-tls/ca.crt" telemetry { unauthenticated_metrics_access = true } } storage "raft" { path = "/vault/data" retry_join { leader_api_addr = "https://vault-0.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } retry_join { leader_api_addr = "https://vault-1.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } retry_join { leader_api_addr = "https://vault-2.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } } telemetry { prometheus_retention_time = "24h" disable_hostname = true enable_hostname_label = true } seal "awskms" { region = "eu-central-1" kms_key_id = "..." } service_registration "kubernetes" {} config: | ui = true listener "tcp" { tls_disable = 1 address = "[::]:8200" cluster_address = "[::]:8201" } storage "consul" { path = "vault" address = "HOST_IP:8500" } service_registration "kubernetes" {} disruptionBudget: enabled: true maxUnavailable: null serviceAccount: annotations: {} ui: enabled: true serviceType: "ClusterIP" serviceNodePort: null externalPort: 8200 annotations: {} ```Additional context We're seeing a spike in the CPU and memory consumption of
vault-2
right around the point in time when the logs regarding the TLS handshake errors start (around 10:38:30, which correlates with the timestamps of the log messages). Interestingly, the memory consumption is significantly higher afterwards: