ukwa / ukwa-monitor

Dashboard and monitoring system for the UK Web Archive
0 stars 5 forks source link

Add alerts for Airflow #35

Closed anjackson closed 2 years ago

anjackson commented 2 years ago

Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.

http://monitor-prometheus.api.wa.bl.uk/graph?g0.expr=airflow_dag_last_status&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

Where the airflow_dag_last_status metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because the for: 2hr period is too long:

https://github.com/ukwa/ukwa-monitor/blob/79ccc4ba115248bc2a37f5662ead74ecf93b105f/monitor/prometheus/alert.rules.yml#L133

Could you tweak it down to for: 5m so we know sooner if jobs are failing.

GilHoggarth commented 2 years ago

Updated in latest commit. Will tag and release to master.