Init container TCP timeout causes retries forever

sidewinder12s commented 3 years ago

Describe the bug It appears if the Vault init container runs into TCP connection errors (as opposed to a HTTP 500 or 400 error), it will continue retrying forever. The deployment we had that ran into this behavior only got deleted when another process removed the entire pod as failed.

To Reproduce Steps to reproduce the behavior:

Deploy application annotated for vault-agent injection
Have some kind of networking failure on the node that disallows a connection to Vault.
Watch the init container run forever.

Application deployment:

Annotations: vault.hashicorp.com/agent-inject: true
              vault.hashicorp.com/agent-inject-secret-airflow: kv-v2/secret
              vault.hashicorp.com/agent-inject-secret-cloud-swe-jwt: kv-v2/secret
              vault.hashicorp.com/agent-inject-status: injected
              vault.hashicorp.com/agent-inject-template-airflow:

                {{- with secret "kv-v2/secrett" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-inject-template-cloud-swe-jwt:

                {{- with secret "kv-v2/secret" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-pre-populate-only: true
              vault.hashicorp.com/role: my-role


Dec 16, 2020 @ 09:19:50.568 | 2020-12-16T17:19:50.568Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.696763184

 Dec 16, 2020 @ 09:19:20.567 | 2020-12-16T17:19:20.567Z [INFO]  auth.handler: authenticating
  | Dec 16, 2020 @ 09:19:17.792 | 2020-12-16T17:19:17.792Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=2.774728483

  | Dec 16, 2020 @ 09:18:47.791 | 2020-12-16T17:18:47.791Z [INFO]  auth.handler: authenticating

  | Dec 16, 2020 @ 09:18:46.354 | 2020-12-16T17:18:46.354Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.437024355

Expected behavior

Would have expected the client timeout or retry limits to possibly have an effect. Or just a hard timeout/give up at some point.

Environment

Kubernetes version: 1.18
- Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): EKS
vault-k8s version: 0.6.0

Additional context I think we've been running into conntrack limits on some of our nodes which have lead to dropped packets, though this failure and the duration of it almost seem like the node itself had something wrong with it.

jasonodonnell commented 3 years ago

Hi @sidewinder12s, which version of Vault Agent are you using?

sidewinder12s commented 3 years ago

Version 1.5.4

sidewinder12s commented 3 years ago

Root cause on our k8s cluster was high # of DNS requests/conntrack entries across the cluster (which would then overload the node vault agent was running on). I've solved that now with reduced kube-proxy usage/node local DNS caching but I assume the issue still stands.

esethuraman commented 3 years ago

@sidewinder12s Can you please detail the steps for finding the k8s cluster traffic?

sidewinder12s commented 3 years ago

I had Prometheus node exporter which has metrics for node conntrack entries. But I've now seen this bug/behavior a couple other times where the TCP connection fails rather than throwing an HTTP error code. Once where we broke routing to Vault cross AWS Accounts, in addition to this node connection failure.

maksemuz commented 1 year ago

Experiencing the same issue. K8s v1.23.16 Vault v1.13.1 Is there any fix/workaround?

puneetloya commented 1 month ago

I have seen similar behavior when Vault server was having trouble and we were seeing context deadline exceeded from the client with the container not crashing at all until I manually deleted the pod.

hashicorp / vault-k8s

Init container TCP timeout causes retries forever #203