m-lab / prometheus-support

Prometheus configuration for M-Lab running on GKE
Apache License 2.0
19 stars 11 forks source link

Wait longer before firing GardenerHistoricalThroughputIsStalled alert #1018

Closed stephen-soltesz closed 10 months ago

stephen-soltesz commented 10 months ago

The GardenerHistoricalThroughputIsStalled alert fired on Aug 15 2023 and again Nov 12 before clearing automatically both times. These dates correspond to times when historical processing reset to the start of the historical data in 2016. However, historical processing resets occur monthly (as of today), but the alert does not fire every time.

After a historical reset, there is no ndt7 data between 2016-2020. It takes the pipeline about 1.3days until the data prior to 2020 (when ndt7 is introduced again) is fully processed. The issue with the alert appears to be once ndt7 begins being processed again (increase(gardener_jobs_total{status="success", daily="false"}[1d]) > 0) but bq_gardener_historical_throughput is still zero. This metric comes from the 3h bigquery exporter, so may be delayed up to 3h, which may be additionally delayed by other queries beyond the 4h alert hold time.

So, this change increases the hold time for the alert to 8hr.


This change is Reviewable