prometheus / blackbox_exporter

Blackbox prober exporter
https://prometheus.io
Apache License 2.0
4.48k stars 1.03k forks source link

Cannot find a way to have an alarm on expired certificates #1119

Open JDA88 opened 10 months ago

JDA88 commented 10 months ago

Host operating system: output of uname -a

Windows

blackbox_exporter version: output of blackbox_exporter --version

0.24.0

What is the blackbox.yml module config.

  http_tls_check_only_1:
    prober: http
    http:
      valid_status_codes: [200,401]
      preferred_ip_protocol: ip4
      method: GET
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: true
  http_tls_check_only_2:
    prober: http
    http:
      valid_status_codes: [200,401]
      preferred_ip_protocol: ip4
      method: GET
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false

What did you do that produced an error?

No matter what insecure_skip_verify option I use if the target certificate is expired the ssl related metrics are not present, probe_ssl_last_chain_expiry_timestamp_seconds etc.

What did you expect to see?

A way to differentiate a site not responding from a site with an expired certificate.

What did you see instead?

I have an alarm for certificate that will expire “soon” but as soon as the certificate is expired it doesn’t work anymore.

I don’t mind the probe failing on expired certificate, but when the probe is failing with tls: failed to verify certificate: x509: certificate has expired or is not yet valid the metrics with the time stamps should still be present. That will allow to have alerts with message like “certificate expires x days ago

Another option could also be a new probe_failed_due_to_expired_certificate metric

dswarbrick commented 10 months ago

I am not able to reproduce this. With insecure_skip_verify: true, a probe against https://expired.badssl.com returns:

# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 5.000845812
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 5.6184539430000005
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 494
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.13220203
probe_http_duration_seconds{phase="processing"} 0.204624022
probe_http_duration_seconds{phase="resolve"} 5.000845812
probe_http_duration_seconds{phase="tls"} 0.280287724
probe_http_duration_seconds{phase="transfer"} 0.00013658
# HELP probe_http_last_modified_timestamp_seconds Returns the Last-Modified HTTP response header in unixtime
# TYPE probe_http_last_modified_timestamp_seconds gauge
probe_http_last_modified_timestamp_seconds 1.689976823e+09
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 1
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 494
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 1.56181497e+08
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_ssl_earliest_cert_expiry Returns last SSL chain expiry in unixtime
# TYPE probe_ssl_earliest_cert_expiry gauge
probe_ssl_earliest_cert_expiry 1.428883199e+09
# HELP probe_ssl_last_chain_expiry_timestamp_seconds Returns last SSL chain expiry in timestamp
# TYPE probe_ssl_last_chain_expiry_timestamp_seconds gauge
probe_ssl_last_chain_expiry_timestamp_seconds -6.21355968e+10
# HELP probe_ssl_last_chain_info Contains SSL leaf certificate information
# TYPE probe_ssl_last_chain_info gauge
probe_ssl_last_chain_info{fingerprint_sha256="ba105ce02bac76888ecee47cd4eb7941653e9ac993b61b2eb3dcc82014d21b4f",issuer="CN=COMODO RSA Domain Validation Secure Server CA,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB",subject="CN=*.badssl.com,OU=Domain Control Validated+OU=PositiveSSL Wildcard",subjectalternative="*.badssl.com,badssl.com"} 1
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
# HELP probe_tls_version_info Returns the TLS version used or NaN when unknown
# TYPE probe_tls_version_info gauge
probe_tls_version_info{version="TLS 1.2"} 1

Logs:

ts=2023-09-11T02:28:56.857917859Z caller=main.go:181 module=http_2xx target=https://expired.badssl.com level=info msg="Beginning probe" probe=http timeout_seconds=119.5
ts=2023-09-11T02:28:56.857991462Z caller=http.go:328 module=http_2xx target=https://expired.badssl.com level=info msg="Resolving target address" target=expired.badssl.com ip_protocol=ip4
ts=2023-09-11T02:29:01.858742185Z caller=http.go:328 module=http_2xx target=https://expired.badssl.com level=info msg="Resolved target address" target=expired.badssl.com ip=104.154.89.105
ts=2023-09-11T02:29:01.858916072Z caller=client.go:260 module=http_2xx target=https://expired.badssl.com level=info msg="Making HTTP request" url=https://104.154.89.105 host=expired.badssl.com
ts=2023-09-11T02:29:02.476254806Z caller=handler.go:120 module=http_2xx target=https://expired.badssl.com level=info msg="Received HTTP response" status_code=200
ts=2023-09-11T02:29:02.47637202Z caller=handler.go:120 module=http_2xx target=https://expired.badssl.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2023-09-11T04:29:01.859041909+02:00 dnsDone=2023-09-11T04:29:01.859041909+02:00 connectDone=2023-09-11T04:29:01.991243924+02:00 gotConn=2023-09-11T04:29:02.271599196+02:00 responseStart=2023-09-11T04:29:02.476223259+02:00 tlsStart=2023-09-11T04:29:01.991292589+02:00 tlsDone=2023-09-11T04:29:02.271580327+02:00 end=2023-09-11T04:29:02.476359843+02:00
ts=2023-09-11T02:29:02.476410357Z caller=main.go:181 module=http_2xx target=https://expired.badssl.com level=info msg="Probe succeeded" duration_seconds=5.6184539430000005
JDA88 commented 10 months ago

Case 1: insecure_skip_verify: true on https://expired.badssl.com/

Case 2: insecure_skip_verify: false on https://expired.badssl.com/ :

Case 3: insecure_skip_verify: true on https://google.com/ :

Case 4: insecure_skip_verify: false on https://google.com/ :

The issue with insecure_skip_verify is that as soon as you put it to true the timestamp is always negative and you can't have an alarm with the date. Currently we have 2 options:

To have both alarms working you have to query the same target with two modules

IMO timestamp should never be negative if the certificate is present.

dswarbrick commented 10 months ago

insecure_skip_verify will have negligible effect on a target that has well-maintained and up to date certificate chain (such as google.com), so cases 3 & 4 are obvious and to be expected.

Case 2 is also to be expected, since a tls.Config with InsecureSkipVerify false (i.e., the default) will return an error during a TLS handshake, causing blackbox_exporter's probe to fail (and thus the usual metrics will be missing).

However, your case 1 disagrees with the results of my test, which as you can see in my previous comment, had probe_success 1. An alerting rule which checked for probe_success == 1 and probe_ssl_earliest_cert_expiry < time() would achieve your goal of differentiating a site not responding from a site with an expired certificate.

If you append &debug=true to your probe, it will shed more light on why it considers the probe failed.

dswarbrick commented 10 months ago

Incidentally, the probe_ssl_last_chain_expiry_timestamp_seconds metric will be meaningless on probes with insecure_skip_verify: true, since it is derived from the tls.ConnectionState slice of VerifiedChains, which is empty if Config.InsecureSkipVerify is true.

In such cases, the probe_ssl_last_chain_expiry_timestamp_seconds is set to the Unix epoch value of an uninitialised time.Time{} (i.e., time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC)), which is -62135596800.

In contrast, the probe_ssl_earliest_cert_expiry metric is derived from the PeerCertificates slice in the tls.ConnectionState, which is unaffected by InsecureSkipVerify. blackbox_exporter simply ranges over these and returns the earliest "NotAfter" date. Technically, this may not be the target's certificate, as it will be the earliest expiry date of any certificate which is offered by the target during the TLS handshake. For example, if the target sends a chain containing intermediate CAs, and one of those CAs expires before the target certificate itself, then probe_ssl_earliest_cert_expiry will be set accordingly. This is usually what you want, since the expiration of any certificate in the chain would cause a TLS handshake failure.

JDA88 commented 10 months ago

You are right, my bad, Case 1 was returning a 403 because our transparent proxy blocked the website from the server subnet (Access denied due to bad server certificate)

So, to summarize: insecure_skip_verify Certificate status probe_success SSL timestamp metrics Can detect certificate expiring soon Can detect certificate expired
true Valid true -62135596800 No No
true Expired true -62135596800 No No
false Valid true Expiration date Query 1 ?
false Expired false Missing Query 1 ?

Query 1: (probe_ssl_last_chain_expiry_timestamp_seconds{} - time()) < (86400 * 15)

To detect a certificate expiring soon insecure_skip_verify: false is required. So in order to detect a certificate already expired I need a query with:

I can't find a way to make this work, my promql level is not good enough

dswarbrick commented 10 months ago

To detect a certificate expiring soon, it does not matter what insecure_skip_verify is set to. Setting insecure_skip_verify: true is only necessary when probing a target whose certificate has already expired, since the TLS handshake would otherwise fail, causing the entire probe to fail.

JDA88 commented 10 months ago

To detect a certificate expiring soon, it does not matter what insecure_skip_verify is set to. Setting insecure_skip_verify: true is only necessary when probing a target whose certificate has already expired, since the TLS handshake would otherwise fail, causing the entire probe to fail.

I know, but having insecure_skip_verify: true is useless because once set to true there is no way to tell the difference between a certificate expired or not and this is my main issue

dswarbrick commented 10 months ago

I know, but having insecure_skip_verify: true is useless because once set to true there is no way to tell the difference between a certificate expired or not and this is my main issue

probe_success is a pretty broad metric, but assuming that the only issue is that the TLS verification fails (which may be for reasons other than certificate expiration), then the probe_ssl_earliest_cert_expiry is usable even when insecure_skip_verify is set to true.

In the past I have simply used probe_ssl_earliest_cert_expiry - time() < 14 * 86400 to alert when a certificate has less than 14 days before expiring.

If you want to also be able to determine the actual expiry date of a target which has already expired, then you will of course need to probe using a blackbox module with insecure_skip_verify: true. However, due to the fact that most TLS clients will refuse to connect to an expired peer anyway, the probe_success 0 is arguably sufficient enough to call attention to such a scenario.

As I already explained, probe_ssl_last_chain_expiry_timestamp_seconds is useless if insecure_skip_verify is set to true. This is due to the internal implementation details of the tls package in Go.

dswarbrick commented 10 months ago

The crux is that you really need to alert for (and resolve) expiring certificates before they expire. Once they have expired, probe_success will be zero, and if you are probing using a module with insecure_skip_verify: false (i.e., the default and recommended setting), then there isn't really any other useful metric that indicates why the probe failed.

If you know that you have targets whose certificates have already expired, then your only real option is to probe them with insecure_skip_verify: true, and use the probe_ssl_earliest_cert_expiry metric.

JDA88 commented 10 months ago

The crux is that you really need to alert for (and resolve) expiring certificates before they expire.

100% agree. Unfortunately, on the real life (and with internal certificates) it’s not uncommon to have a certificate expired. And with the current implementation the classic probe_ssl_earliest_cert_expiry - time() < x alert disappear as soon as the certificate expire and a new probe down appear wich is sub optimal for alert tracking.

I still think we could use a new probe_failed_due_to_expired_certificate metric for those cases