Open lpetrazickisupgrade opened 6 months ago
What seems to be happening is that Prometheus loads updated certs into new connections but not existing connections: https://github.com/prometheus/common/blob/v0.50.0/config/http_config.go#L979
Connections are set to remain open unless they are idle for 5 minutes. As long as the scrape interval is significantly shorter than 5 minutes, they remain open indefinitely: https://github.com/prometheus/common/blob/main/config/http_config.go#L54
One possible enhancement could be for Prometheus to flush any connection that hits a 403 error
We are seeing the same too, namely k8s tls_config
certs are not used for existing connections and eventually prometheus ends up using expired certificates for existing connections.
+1 to flush connections on 403 and/or on cert reload
Did any of you find a solution/workaround to this, besides increasing scrape durations for tbot targets? I'm running into the same issue and haven't found any robust solution...
I'll also +1 on implementing the 403 flush/reload... would help tremendously.
Edit: After I wrote this, I did find a workaround that I'm currently using...
Instead of connecting to the tunnel endpoint directly, I just put nginx
in front of it as the endpoint prometheus connects to. Now, no matter how quickly tbot
rotates certs, there's always a connection. Works for me, for now, until a fix can be implemented in prometheus itself.
The big negative.... my scrape durations almost tripled from 40ms
on the native tbot
tunnel, to about 120ms
through the additional reverse proxy layer.
I'm running Prometheus Operator 0.71.2 with Prometheus 2.49.1 on EKS
I have metric endpoints protected by TLS cert and key. Teleport Tbot rotates the cert and key every n hours and writes them to a secret. There's a Probe resource that refers to that secret. Prometheus Operator loads the Probe into a Prometheus instance and rewrites the secret for that instance. Prometheus uses the rewritten secret to access the endpoint
What I'm seeing is that:
The secrets look up to date on the Prometheus pod filesystem during the issue
Probe definition:
Generated config:
This sounds similar to #345 but still happening today