Token renewal fails although the renewal request succeeds

L7R7 commented 3 years ago

Describe the bug Token renewal doesn't work properly, although the token renewal request via HTTP succeeds. We have a script that renews the token regularly. What I can see in the script's logs is something like the following:

....

2020-10-30T10:33:27+01:00   token renewal succeeded
2020-10-30T10:38:44+01:00   token renewal failed
2020-10-30T10:38:44+01:00   curl(22): curl: (22) The requested URL returned error: 504
2020-10-30T10:39:01+01:00   token renewal succeeded
2020-10-30T10:44:03+01:00   token renewal failed
2020-10-30T10:44:03+01:00   curl(22): curl: (22) The requested URL returned error: 403
2020-10-30T10:44:20+01:00   token renewal failed
2020-10-30T10:44:20+01:00   curl(22): curl: (22) The requested URL returned error: 403

....

This happens after the token is in use for a couple of days. My interpretation of the logs is:

The token renewal fails with a 504 response from vault
The script retries it 15s later and succeeds (which means that the token should be usable for 10 more mins)
On the next renewal ~5mins later, the requests are rejected because the token is expired and can't be renewed anymore.

We're running Vault on Kubernetes using 3 instances, HA mode enabled and with the Raft Storage Backend. In the logs from the vault pods I can see quite a lot of error messages like this:

2020-10-30T10:38:11+01:00   info    vault-2 http: TLS handshake error from <redacted ip>:56684: EOF

This is the first one, coming from the pod that's the cluster leader. There are quite a lot of the same messages directly after that. In between I see the following:

2020-10-30T10:38:28+01:00   warn    vault-0 heartbeat timeout reached, starting election
2020-10-30T10:38:28+01:00   info    vault-0 entering candidate state
2020-10-30T10:38:28+01:00   warn    vault-1 rejecting vote request since we have a leader
2020-10-30T10:38:34+01:00   warn    vault-0 Election timeout reached, restarting election
2020-10-30T10:38:34+01:00   info    vault-0 entering candidate state
2020-10-30T10:38:34+01:00   warn    vault-1 rejecting vote request since we have a leader
2020-10-30T10:38:39+01:00   info    vault-0 entering candidate state
2020-10-30T10:38:39+01:00   warn    vault-0 Election timeout reached, restarting election
2020-10-30T10:38:39+01:00   warn    vault-1 rejecting vote request since we have a leader
2020-10-30T10:38:40+01:00   error   vault-0 failed to make requestVote RPC  read tcp <redacted ip>:41262-><redacted ip>:8201: i/o timeout
2020-10-30T10:38:44+01:00   error   vault-0 failed to make requestVote RPC  tls: DialWithDialer timed out

I assume that vault-0 and vault-1 think that there's no leader and try to become a leader.

I'm wondering how it's possible that the token renewal succeeds but apparently doesn't. The only possible problem I can think of is some kind of split brain scenario where the successful token renewal isn't done by vault-2 but from another pod, or it's done by vault-2 but not propagated internally. What's also interesting is the fact that vault-2 stays the cluster leader the whole time.

We can work around this issue by killing the pod that can't renew its token. All our other pods (including the one that is started as a replacement for the killed one) are running fine.

To Reproduce The issue occurs very rarely. I don't see anything obvious on how to reproduce the issue. We observed this behaviour already a couple of weeks ago. Other than that we don't have issues with the Vault installation.

Expected behavior If the token renewal is successful, it should actually be successful.

Environment:

Vault Server Version (retrieve with vault status): 1.5.3 (we observed the same problem with 1.5.2 as well)

Helm chart Config

```yaml global: enabled: true imagePullSecrets: [] tlsDisable: false openshift: false injector: enabled: false externalVaultAddr: "" image: repository: "hashicorp/vault-k8s" tag: "0.4.0" pullPolicy: IfNotPresent agentImage: repository: "vault" tag: "1.5.3" authPath: "auth/kubernetes" logLevel: "info" logFormat: "standard" revokeOnShutdown: false namespaceSelector: {} certs: secretName: null caBundle: "" certName: tls.crt keyName: tls.key resources: {} extraEnvironmentVars: {} affinity: null tolerations: null nodeSelector: null priorityClassName: "" server: image: repository: "vault" tag: "1.5.3" pullPolicy: IfNotPresent updateStrategyType: "RollingUpdate" resources: requests: memory: 256Mi cpu: 150m limits: memory: 256Mi cpu: 150m ingress: enabled: true labels: {} annotations: nginx.ingress.kubernetes.io/backend-protocol: HTTPS hosts: - host: sample-host.de paths: [] tls: - secretName: wildcard-host-de hosts: - sample-host.de route: enabled: false labels: {} annotations: {} host: chart-example.local authDelegator: enabled: false extraInitContainers: null extraContainers: null shareProcessNamespace: false extraArgs: "" readinessProbe: enabled: true livenessProbe: enabled: false path: "/v1/sys/health?standbyok=true" initialDelaySeconds: 60 preStopSleepSeconds: 5 postStart: [] extraEnvironmentVars: VAULT_CACERT: /vault/userconfig/vault-server-tls/ca.crt VAULT_LOG_FORMAT: json extraSecretEnvironmentVars: - envName: AWS_ACCESS_KEY_ID secretName: vault-unseal secretKey: AWS_ACCESS_KEY_ID - envName: AWS_SECRET_ACCESS_KEY secretName: vault-unseal secretKey: AWS_SECRET_ACCESS_KEY extraVolumes: - type: secret name: vault-server-tls affinity: | podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: {{ template "vault.name" . }} app.kubernetes.io/instance: "{{ .Release.Name }}" component: server topologyKey: kubernetes.io/hostname tolerations: null nodeSelector: null priorityClassName: "" extraLabels: {} annotations: {} service: enabled: true port: 8200 targetPort: 8200 annotations: {} dataStorage: enabled: true size: 10Gi storageClass: null accessMode: ReadWriteOnce auditStorage: enabled: false size: 10Gi storageClass: null accessMode: ReadWriteOnce dev: enabled: false standalone: enabled: "-" config: | ui = true listener "tcp" { tls_disable = 1 address = "[::]:8200" cluster_address = "[::]:8201" } storage "file" { path = "/vault/data" } ha: enabled: true replicas: 3 raft: enabled: true setNodeId: true config: | ui = true listener "tcp" { tls_disable = 0 address = "[::]:8200" cluster_address = "[::]:8201" tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt" tls_key_file = "/vault/userconfig/vault-server-tls/vault.key" tls_client_ca_file = "/vault/userconfig/vault-server-tls/ca.crt" telemetry { unauthenticated_metrics_access = true } } storage "raft" { path = "/vault/data" retry_join { leader_api_addr = "https://vault-0.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } retry_join { leader_api_addr = "https://vault-1.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } retry_join { leader_api_addr = "https://vault-2.vault-internal:8200" leader_ca_cert_file = "/vault/userconfig/vault-server-tls/ca.crt" } } telemetry { prometheus_retention_time = "24h" disable_hostname = true enable_hostname_label = true } seal "awskms" { region = "eu-central-1" kms_key_id = "..." } service_registration "kubernetes" {} config: | ui = true listener "tcp" { tls_disable = 1 address = "[::]:8200" cluster_address = "[::]:8201" } storage "consul" { path = "vault" address = "HOST_IP:8500" } service_registration "kubernetes" {} disruptionBudget: enabled: true maxUnavailable: null serviceAccount: annotations: {} ui: enabled: true serviceType: "ClusterIP" serviceNodePort: null externalPort: 8200 annotations: {} ```

Additional context We're seeing a spike in the CPU and memory consumption of vault-2 right around the point in time when the logs regarding the TLS handshake errors start (around 10:38:30, which correlates with the timestamps of the log messages). Interestingly, the memory consumption is significantly higher afterwards:

ncabatoff commented 3 years ago

Hi @L7R7,

Sorry we didn't get back to you at the time. Are you still experiencing this issue? If so, may I suggest that you have the script log the response from the token renewal, and/or enable auditing so you can example the responses that way? I can't explain the behaviour you're seeing but I suspect this would shed some light.

The only possible problem I can think of is some kind of split brain scenario where the successful token renewal isn't done by vault-2 but from another pod, or it's done by vault-2 but not propagated internally.

That shouldn't be possible with Raft. Only the leader can make changes to stored data, and there can only be a leader within a partition that has quorum (i.e. >50% of the nodes). Furthermore, attempts to store data will fail unless they can be persisted to a quorum of nodes.

I'll leave this ticket open while I wait to hear back from you, if I don't after a few weeks I'll assume the issue is resolved.

L7R7 commented 3 years ago

@ncabatoff Thanks for your answer! I'll close this issue as I haven't seen the problem for months now, so it seems to be resolved. I don't know what it fixed, but seems to be fixed.

and/or enable auditing

This is a good idea, we didn't try that back then. I'll keep that in mind in case we're having issues in the future.

hashicorp / vault

Token renewal fails although the renewal request succeeds #10277