open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.94k stars 2.29k forks source link

New component: TLS Check Receiver #35423

Open michael-burt opened 3 days ago

michael-burt commented 3 days ago

The purpose and use-cases of the new component

This purpose of the tlscheckreceiver is to provide expiry information about x.509 certificates. Monitoring the expiration of certificates is not achievable via any existing OpenTelemetry Collector receivers. This receiver would also check the validity of the x.509 certificate chain and emit this information as an attribute.

Example configuration for the component

receivers:
  tlscheck:
    targets:
      - url: https://example.com
      - url: https://foobar.com:8080

Telemetry data types supported

metrics

Is this a vendor-specific component?

Code Owner(s)

@michael-burt

Sponsor (optional)

@atoulme

Additional context

A draft metadata.yaml

type: receiver/tlscheck
scope_name: tlscheck
status:
  class: receiver
  stability:
    development: metrics
  codeowners:
    active: @michael-burt

sem_conv_version: 1.9.0

attributes:
  tlscheck.url:
    enabled: true
    description: Url at which the certificate was accessed.
    type: string
  tlscheck.x509_isvalid:
    enabled: true
    description: Is the certificate chain valid.
    type: bool
  tlscheck.x509_issuer:
    enabled: true
    description: The entity that issued the certificate, typically a Certificate Authority (CA).
    type: string
  tlscheck.x509_subject:
    enabled: true
    description: The entity that the certificate belongs to.
    type: string

metrics:
  tlscheck.time_left:
    enabled: true
    description: Time in seconds until certificate expiry, as specified by `NotAfter` field in the x.509 certificate.
    unit: seconds
    gauge:
      value_type: int
breedx-splk commented 3 days ago

Interesting idea, @michael-burt. A few comments:

- host: https://example.com - the thing on the righthand side isn't a host there, it's a url....so probably better to change the key to url. Given that, I'd also change tlscheck.host to tlscheck.url.

Another challenge is that x509 certs typically have two dates -- notBefore and notAfter (and maybe even more are possible?). I assume that most people care about the notAfter most of the time, but it's something to keep in mind. Consider updating the description to make this very explicit. I wouldn't bother with notBefore until someone claims a real need for it.

I think modeling this as a gauge counter representing the number of seconds until expiry (notAfter) seems more useful, rather than reporting the same constant datapoint value (epoch seconds) for the lifespan of a cert. Assuming a typical use case is to trigger an alert when there are fewer than 30 days left or whatever, that's much easier to do with a "time remaining" metric because you don't have to compute anything. This also makes it similar to the hardware/battery "time_left" metric as prior art (reference).

michael-burt commented 3 days ago

Thanks @breedx-splk , I made those updates. Do you have any suggestions for attribute names, or are these sufficient?

tlscheck.x509_isvalid
tlscheck.x509_issuer
tlscheck.x509_subject
MovieStoreGuy commented 3 days ago

Hey @michael-burt ,

This seems like a cool idea, my only nit is to make tlscheck.x509_isvalid a seperate metric instead of being an attribute on the metric. It would make it easier in metric backends to alert on invalid certs.

atoulme commented 3 days ago

This looks like a worthy component. I can sponsor it.

michael-burt commented 2 days ago

This seems like a cool idea, my only nit is to make tlscheck.x509_isvalid a seperate metric instead of being an attribute on the metric. It would make it easier in metric backends to alert on invalid certs.

Hi @MovieStoreGuy, I considered putting it in a metric but I couldn't really find examples of unitless boolean values being emitted as metrics within contrib except for this one in k8sclusterreceiver.

It makes me think that it goes against convention to have boolean metrics.

MovieStoreGuy commented 2 days ago

I see the point, it is also not something that is a hard blocker for this. I am more thinking of being lazy with the queries instead of matching against the attributes sent.

cwegener commented 2 days ago

I considered putting it in a metric but I couldn't really find examples of unitless boolean values being emitted as metrics within contrib except for this one in k8sclusterreceiver.

It makes me think that it goes against convention to have boolean metrics.

Determining the validity of a X509 certificate would be good use case for the new EntityState Events, with x509 certificate being the Entity and is_valid being the Entity Attribute. (Although I can see immediate problems with the semantic definition of is_valid here already).

Having said that, the attribute name for tlscheck.x509_isvalid should probably be changed, as this term is ambiguous. Both, path validation and validity period are terms that are used in RFC5280 and they are both distinct.

So, maybe a more explicit tlscheck.x509_pathvalid would be better in order to avoid end-user confusion?

michael-burt commented 1 day ago

Thanks for the good feedback @cwegener.

https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35441/files#diff-fd4a88524697321ef8eb94214702d17988629021b4f234aa751899645eefe5dcR31

If the certificate is expired, tlscheck.time_left will be 0 and the tlscheck.x509_isvalid attribute will be false.

If the date and time specified in the NotBefore field of the x.509 certificate has not yet occurred, tlscheck.time_left will be reported as a positive integer and tlscheck.x509_isvalid attribute will be false.

I will change these docs and add a tlscheck.x509_periodvalid as well to monitor the case when NotBefore is in the future but the chain is valid.

cwegener commented 1 day ago

I will change these docs and add a tlscheck.x509_periodvalid as well to monitor the case when NotBefore is in the future but the chain is valid.

I think that the attribute that describes whether or not the certificate was within its validity period at the time when the check was performed could simply just be one single attribute (e.g. the tlscheck.x509_periodvalid) and this attribute would be true as long as NotBefore < Time.now() < NotAfter is true. And in all other cases tlscheck.x509_periodvalid would be false.

And additionaly there could be another attribute tlscheck.x509_pathvalid which stores results of path validation.

And then tlscheck.x509_isvalid would be the result of tlscheck.x509_periodvalid && tlscheck.x509_pathvalid

Something else to keep in mind for the path validation is the CA trust store (root CA store). Because any sort of path validation also requires to agree on which CA trust store to be used. But an initial implementation could just be limited to the default CA trust store (system store) as decided by crypto/x509 https://pkg.go.dev/crypto/x509#VerifyOptions

Lastly, on the definition of x509_isvalid:

Technically, this would be the combination of these checks:

And some users may want to have config options for any of these checks to be included in the calculation of the final overall 'is valid' attribute value.

cwegener commented 1 day ago

Also, I do wonder if it might be worth building an attribute hierarchy like this, because in the future, the tlsccheckreceiver might not just be limited to report x509 related attributes but also tls related attributes.

+ tlscheck
|\
| + x509
|  \
|   + is_valid
|   |
|   + periodvalid
|   |
|   + pathvalid
 \
  + tls
   \
    + proto_version

The initial implementation can still be limited to just x509 functionality, but with this naming hierarchy in place, it will be smoother to add new attributes into the hierarchy in the future without breaking the existing attributes.

michael-burt commented 1 day ago

Other than now() < NotAfter, any other form of certificate verification is beyond what I need for my use case. Maybe it would be best to leave out x509_periodvalid and x509_pathvalid for the first iteration of this component since there are other considerations when reporting certificate validity.

I agree about the attribute hierarchy, I can change the name of the attributes to x509.subject x509.cn or x509.commonname and x509.issuer.