m-lab / prometheus-support

Prometheus configuration for M-Lab running on GKE
Apache License 2.0
19 stars 11 forks source link

Update limit for GardenerFailureRateTooHighOrMissing to prevent false alarms #1023

Closed stephen-soltesz closed 9 months ago

stephen-soltesz commented 9 months ago

For what seems like years, the GardenerFailureRateTooHighOrMissing will fire periodically and then clear. https://github.com/m-lab/dev-tracker/issues/744 This occurs regularly with the historical reprocessing. This change modifies the alert threshold to greater than 3 failures for more than 24h.

The primary observation is that the total number of failed jobs (not the rate) that trigger this alert is rarely above 3. While any failure is cause for concern (the data has not changed, so it's either random or static failure), until we can investigate the root cause and distinguish between failures that are due to some random event or the same file every time, we can prevent the alert from firing for less critical conditions.

I was surprised to see that the failures occur before the historical reset, closer to current dates. The failures always seem to be "daily=false" for the historical pipeline. I could not identify specific dates yet. But believe I could with more investigation.

See:


This change is Reviewable