Open sidewinder12s opened 3 years ago
Hi @sidewinder12s, which version of Vault Agent are you using?
Version 1.5.4
Root cause on our k8s cluster was high # of DNS requests/conntrack entries across the cluster (which would then overload the node vault agent was running on). I've solved that now with reduced kube-proxy usage/node local DNS caching but I assume the issue still stands.
@sidewinder12s Can you please detail the steps for finding the k8s cluster traffic?
I had Prometheus node exporter which has metrics for node conntrack entries. But I've now seen this bug/behavior a couple other times where the TCP connection fails rather than throwing an HTTP error code. Once where we broke routing to Vault cross AWS Accounts, in addition to this node connection failure.
Experiencing the same issue. K8s v1.23.16 Vault v1.13.1 Is there any fix/workaround?
I have seen similar behavior when Vault server was having trouble and we were seeing context deadline exceeded from the client with the container not crashing at all until I manually deleted the pod.
Describe the bug It appears if the Vault init container runs into TCP connection errors (as opposed to a HTTP 500 or 400 error), it will continue retrying forever. The deployment we had that ran into this behavior only got deleted when another process removed the entire pod as failed.
To Reproduce Steps to reproduce the behavior:
Application deployment:
Expected behavior
Would have expected the client timeout or retry limits to possibly have an effect. Or just a hard timeout/give up at some point.
Environment
Additional context I think we've been running into conntrack limits on some of our nodes which have lead to dropped packets, though this failure and the duration of it almost seem like the node itself had something wrong with it.