Closed anjackson closed 3 years ago
Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.
http://monitor-prometheus.api.wa.bl.uk/graph?g0.expr=airflow_dag_last_status&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
Where the airflow_dag_last_status metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because the for: 2hr period is too long:
airflow_dag_last_status
for: 2hr
https://github.com/ukwa/ukwa-monitor/blob/79ccc4ba115248bc2a37f5662ead74ecf93b105f/monitor/prometheus/alert.rules.yml#L133
Could you tweak it down to for: 5m so we know sooner if jobs are failing.
for: 5m
Updated in latest commit. Will tag and release to master.
Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.
http://monitor-prometheus.api.wa.bl.uk/graph?g0.expr=airflow_dag_last_status&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h
Where the
airflow_dag_last_status
metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because thefor: 2hr
period is too long:https://github.com/ukwa/ukwa-monitor/blob/79ccc4ba115248bc2a37f5662ead74ecf93b105f/monitor/prometheus/alert.rules.yml#L133
Could you tweak it down to
for: 5m
so we know sooner if jobs are failing.