spring-projects / spring-boot

Spring Boot helps you to create Spring-powered, production-grade applications and services with absolute minimum fuss.
https://spring.io/projects/spring-boot
Apache License 2.0
75.09k stars 40.67k forks source link

Expose SslBundle information via actuator metrics #42030

Open seschi98 opened 2 months ago

seschi98 commented 2 months ago

I saw that support for SSL bundles was added to the actuator info and health endpoints in https://github.com/spring-projects/spring-boot/pull/41205 and I think it would be really helpful to make that information available in the metrics endpoint as well. I would like to utilize this enhancement to set up an alarm in my monitoring software so that I can renew my certificates before they expire.

I would imagine something like this:

GET http://localhost:8080/actuator/metrics/ssl.bundle.expiry

{
  "name": "ssl.bundle.expiry",
  "baseUnit": "days",
  "measurements": [
    {
      "statistic": "VALUE",
      "value": 351.0
    }
  ],
  "availableTags": [
    {
      "tag": "cert_alias",
      "values": [
        "my-alias"
      ]
    },
    {
      "tag": "bundle_name",
      "values": [
        "mybundle"
      ]
    }
  ]
}
mhalbritter commented 2 months ago

That's an interesting idea. @jonatan-ivanov, what do you think?

jonatan-ivanov commented 2 months ago

I think this is a great idea and we can create a MeterBinder that registers Gauges for every certificate chain. I think we should keep this on the chain level since we can probably uniquely identify it (assumption). The part that concerns me a bit is that the chain can contain multiple certificates, we need to somehow aggregate the "days left" value and the validity status es of all of the certs in the chain. Also we need to face with corner cases, like what if one cert in the chain is not valid yet and another already expired (and so on)?

We also need to be able to somehow "refresh" the registered Gauges since the bundle can change runtime (because of reload).

So I think this is useful but could be trickier than it seems for the first sight. Btw we can also provide counters about the number of chains by status (valid/expired/soon-to-be-expire/etc.) so that people can track them on a graph and also create alerts (e.g.: soon to be expire should be 0).

milaneuh commented 1 month ago

Just an assumption, but maybe we could first expose the SslBundle information on metric by using counters (ex : the second solution @jonatan-ivanov provided) on this issue. And then open another issue for the "bigger" enhancement with the MeterBinder registration ? Would allow us to provide a functionnal solution quicker and then enchancing it.

I'm not a huge open-source contributor, so please correct me if I am wrong, I'm want to learn.

jonatan-ivanov commented 1 month ago

The two solutions I was talking about above ("days left" and "cert count by status") are not mutually exclusive, I think we should do both.

The MeterBinder component is "just" a place where you can register Meters, it makes things more structured, it does not add much to complexity. The complexity of the problem space is aggregating and re-registering Gauges for the "days left" values. For "counting the certs by status" (which is still a Gauge since the value is non-monotonic), we don't have these problems but we still should use a MeterBinder to register them.

milaneuh commented 1 month ago

Okay thanks ! I did a little deep dive inside the code base, looking at how SslInfo stores the certificate chains by the alias etc.

Correct me if I'm wrong, but, we could not aggregate all certificate for the Gauge ? Maybe only aggregate certs that are currently valid, and excludes ones that are already expired or aren't valid yet ? Why do we need them (invalid certs) inside of the gauge ?

Couldn't we just refresh the gauge on reload ?

wilkinsona commented 1 month ago

The part that concerns me a bit is that the chain can contain multiple certificates, we need to somehow aggregate the "days left" value and the validity status es of all of the certs in the chain. Also we need to face with corner cases, like what if one cert in the chain is not valid yet and another already expired (and so on)?

This feels a little bit like aggregating the health status where we use the worst case by default. For example, if one subsystem is down and everything else is up, the overall status will be down. I think a similar assume-the-worst approach makes sense here as a certificate chain is only as good as the "worst" certificate in that chain. For corner cases where there's an expired certificate and a not-yet-valid certificate, I would considered expired to be worse than not-yet-valid so the chain should be considered as having expired. My reasoning being that an expired cert cannot be fixed without someone doing something but a not-yet-valid certificate could, potentially, be fixed just by waiting.

milaneuh commented 1 month ago

For reference, here's the worst case by default aggregation in the health status :

    if (containsOnlyValidCertificates(certificateChain)) {
                    validCertificateChains.add(certificateChain);
                }
    else if (containsInvalidCertificate(certificateChain)) {
                    invalidCertificateChains.add(certificateChain);
                }

As long as a chain contains 1 invalid certificate, the whole chain is added in the "invalidCertificateChains" list.

mhalbritter commented 1 week ago

I currently have something like this:

# HELP ssl_chain_expiry_seconds SSL chain expiry
# TYPE ssl_chain_expiry_seconds gauge
ssl_chain_expiry_seconds{bundle="ssldemo",certificate="7207ee6e",chain="spring-boot-ssl-sample"} -3.15682879E8
# HELP ssl_chains  
# TYPE ssl_chains gauge
ssl_chains{status="expired"} 1.0
ssl_chains{status="not-yet-valid"} 0.0
ssl_chains{status="valid"} 0.0
ssl_chains{status="will-expire-soon"} 0.0

ssl_chain_expiry_seconds is a TimeGauge which shows the time until expiry of the chain (the chain expires when the first certificate in it expires). The certificate is the serial number of the certificate which expires first.

ssl_chains is a Gauge which counts the chains by status. The status of a chain is the "worst" status of the contained certificate, from worst to best:

So far this is working quite nicely. However, reload could be a bit trickier. Essentially we need to remove the gauges which track chains which no longer exist and update the existing ones.

The remove* methods on the MeterRegistry all have @Incubating on them - not sure if we should use them?

jonatan-ivanov commented 1 week ago

Nice! Can you show me the code? I guess we could remove @Incubating from remove, it's there for years. It can be misused with high cardinality tags but there are parts in Micrometer where we use it for similar purposes (e.g.: instrumentation for Kafka metrics).

/cc @shakuzen

mhalbritter commented 1 week ago

Sure. I'll work a bit on it today, and then link the branch here.

shakuzen commented 1 week ago

It sounds like MultiGauge could take care of removing for you.

mhalbritter commented 1 week ago

@jonatan-ivanov: Code is here: https://github.com/mhalbritter/spring-boot/tree/mh/42030-expose-sslbundle-information-via-actuator-metrics

I've used the SampleTomcatSslApplication to test it in action: There's a background thread which fiddles with the Sslbundles after some delay.