ribbybibby / ssl_exporter

Exports Prometheus metrics for TLS certificates
Apache License 2.0
525 stars 99 forks source link

Context deadline exceeded #53

Closed stamina11 closed 3 years ago

stamina11 commented 4 years ago

I monitor about 350 TLS certificates and not all are available to do a remote check, which is fine because i can still get prometheus check that it failed. However, there are about 15 or so that wouldn't even fail to connect so no prometheus metrics are available. In the logs they all look like this. Is there a work around?

time="2020-09-30T22:21:54Z" level=error msg="error=Get \"https://somesite.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers) target=somesite.com prober=https timeout=30s" source="ssl_exporter.go:91"

ribbybibby commented 4 years ago

Hi @stamina11,

The duration of the timeout is taken from the X-Prometheus-Scrape-Timeout-Seconds header which is part of the scrape request and configured by the scrape_timeout in the scrape_config. I suppose with both values being the same it may be the case that Prometheus cancels the request before the ssl_tls_connect_success metric is served by the exporter.

To catch this case you could alert on up{job="your-ssl-job"} == 0 in addition to looking at ssl_tls_connect_success.

It's currently not possible to configure a timeout that is lower than the scrape_timeout but that could be a useful feature. I will look at implementing it.

stamina11 commented 4 years ago

Thank you so much for your response. I believe i didn't explain myself correctly. Those failed certs are showing Down in prometheus, even if update timeouts. Down in Prometheus:

time="2020-09-30T22:21:54Z" level=error msg="error=Get \"https://somesite.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers) target=somesite.com prober=https timeout=30s" source="ssl_exporter.go:91"

Up in Prometheus yet it is still failing to connect. These are OK errors since they shouldn't be able to connect:

time="2020-10-05T21:59:12Z" level=error msg="error=Get \"https://somehost\": dial tcp: lookup somehost on 127.0.0.11:53: no such host target=somehost prober=https timeout=5s" source="ssl_exporter.go:91"

time="2020-10-05T21:59:12Z" level=error msg="error=Get \"https://anotherhost\": x509: certificate signed by unknown authority target=anotherhost prober=https timeout=5s" source="ssl_exporter.go:91"

time="2020-10-05T21:59:13Z" level=error msg="error=Get \"https://anotherhost1\": dial tcp 159.212.78.21:443: connect: connection refused target=anotherhost1 prober=https timeout=5s" source="ssl_exporter.go:91"

But you did point me to a workaround. Even though these metrics are failing i am able to add third one (up==1) which brings in labels i want in Grafana. Here is my query if anyone is having the same issue. ((ssl_cert_not_after{dnsnames=~".+", instance="$instance"} - time ())/24/60/60) or ssl_tls_connect_success {instance="$instance"} ==0 or up{instance="$instance"}==0

Thanks again. You guys are doing awesome work and this is huge help to me personally.

ribbybibby commented 3 years ago

Very happy to hear that you found a workaround for your case. Even though it didn't directly address your request, I've implemented a module-level timeout which was released in v2.2.0.

Closing this issue now.