redhat-cop / cert-utils-operator

Set of functionalities around certificates packaged in a Kubernetes operator
Apache License 2.0
94 stars 35 forks source link

Unable to stop expiry alert #122

Closed darren-oxford closed 2 years ago

darren-oxford commented 2 years ago

Having enabled expiry alert monitoring by adding the cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: 'true' annotation. I now no longer want monitoring on that tls secret but alerts continue.

I tried setting the annotation to cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: 'false' and deleting the annotation but neither stop the alerts. Is there a way to turn off the expiry monitoring without deleting the secret completely?

bromaniac commented 2 years ago

I have somewhat the same problem. I have not added the cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: 'true' annotation on any tls secret yet but get alerts that some certificates are expiring anyway.

I've also tried setting cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: 'false' on some certificates to stop the alerts but like darren-oxford wrote that doesn't help.

raffaelespazzoli commented 2 years ago

this sounds like a bug. a likely work around is to restart the operator.

raffaelespazzoli commented 2 years ago

also notice that with the last feature we released which sends alerts via Prometheus, you probably don't need the event-based alerts anymore.

darren-oxford commented 2 years ago

I can confirm that restarting the operator doesn't work ironically it then results in a doubling up of the alerts. However @raffaelespazzoli mentioning Prometheus makes me realise that the alerts are Prometheus alerts, these seem to be configured for all certificates and that is why @bromaniac is receiving notifications for certificates that do not have expiry-alert annotation. It would therefore seem that it may just be a documentation issue to explain about the change to Prometheus notifications.

darren-oxford commented 2 years ago

Ok, so looking at the recent changes to the code I can see that the operator now generates the following metrics for all certificates.

certutils_certificate_issue_time certutils_certificate_expiry_time cert:validity_duration:sec cert:time_to_expiration:sec

Prometheus is using these to generate alerts, so it would seem that rather than do this for every certificate we should ideally have an annotation to determine if a certificate should have these metrics generated. Or alternatively generate them for all certs by default but exempt those that have an exclusion annotation.

Why would this be useful?

Sometimes we generate a new tls secret instead of updating the existing one so that we can easily roll back to the old certificate in the event that something does not go as planned when using the new certificate, so the old one will still be around for a couple of weeks.

raffaelespazzoli commented 2 years ago

duplicates #123

ok, so idea behind the new alert feature is to generate alert for all certificates. I thought that this would be useful even for automatically rotated certificates because the automation that is supposed to rotate the certs might be failing. So I am not too inclined to remove the feature altogether. Maybe instead we should think about two options:

  1. suppress the alerts in alert manager. I am not too familiar with that tool, but there should be a way to selectively suppress alerts.
  2. write a more tuned alert condition. Thought 85 and 95 percent where two good threshold for cert rotation alert. perhaps we can increase them. Or perhaps we could add a condition to the alert expression that checks on for the existence of a label in the secret. If the label exists the alert is suppressed.
darren-oxford commented 2 years ago

@raffaelespazzoli I agree that this is a good addition, it is better to use Prometheus for notifications.

The main issue is that the documentation is unclear, so this behaviour is unexpected and currently causing user confusion. People reading the README.md will expect to get notifications only when annotation 'cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: "true"' is added to a tls secret.

I suggest the README.md be updated to include under "Alerting when a certificate is about to expire", something like.. "Certificate expiry alerts are automatically generated by Prometheus for all certificates when the remaining certificate lifetime is less than 85%, it is no longer necessary to add Alerting when a certificate is about to expire cert-utils-operator.redhat-cop.io/generate-cert-expiry-alert: "true" when using Prometheus.

Secondary to this is the best way to disable the notifications for those situations where they are unwanted. For some people I guess they don't want notifications at all, they may have installed the operator for another feature such as adding cert info to tls secrets. (Incidentally that was the main reason I installed it for quick viewing of cert info).

For others it may be that they want to disable notifications for certain certificates, ideally a way to do this in an automated way, hence the suggestion to stop the metrics generation with an annotation. Where I can see it being a nuisance on our cluster is that when users install certificates for their application and then don't bother renewing it. If they don't care about their application, neither do I as cluster admin, but it is not my place to delete their applications. So for me I would prefer a way to disable the feature cluster wide and enable it with an annotation either per namespace or per secret.