For what seems like years, the GardenerFailureRateTooHighOrMissing will fire periodically and then clear. https://github.com/m-lab/dev-tracker/issues/744 This occurs regularly with the historical reprocessing. This change modifies the alert threshold to greater than 3 failures for more than 24h.
The primary observation is that the total number of failed jobs (not the rate) that trigger this alert is rarely above 3. While any failure is cause for concern (the data has not changed, so it's either random or static failure), until we can investigate the root cause and distinguish between failures that are due to some random event or the same file every time, we can prevent the alert from firing for less critical conditions.
I was surprised to see that the failures occur before the historical reset, closer to current dates. The failures always seem to be "daily=false" for the historical pipeline. I could not identify specific dates yet. But believe I could with more investigation.
See:
An example from Oct contrasting the date processing of Gardner, the total jobs failed from gardener, the total jobs, the new threshold and the original alert threshold.
For what seems like years, the
GardenerFailureRateTooHighOrMissing
will fire periodically and then clear. https://github.com/m-lab/dev-tracker/issues/744 This occurs regularly with the historical reprocessing. This change modifies the alert threshold to greater than 3 failures for more than 24h.The primary observation is that the total number of failed jobs (not the rate) that trigger this alert is rarely above 3. While any failure is cause for concern (the data has not changed, so it's either random or static failure), until we can investigate the root cause and distinguish between failures that are due to some random event or the same file every time, we can prevent the alert from firing for less critical conditions.
I was surprised to see that the failures occur before the historical reset, closer to current dates. The failures always seem to be "daily=false" for the historical pipeline. I could not identify specific dates yet. But believe I could with more investigation.
See:
This change is