Alert on the failure of a single job

BaCaRoZzo commented 1 year ago

Hi and thanks for the awesome exporter!

We would like to replace the notification system of rundeck with alerts generated by the AlertManager. When a job starts to fail for external reasons - e.g. an endpoint with network issues - rundeck sends a notification for each and every failure. In case of jobs with a sub-hourly schedule, that results in a spamming of our slack channels. In those cases our devs have to disable temporary the failure notification in rundeck itself, which is both annoying and impractical.

We were thinking it would be nice to alert on failure via AlertManager. The idea is that an alert starts to fire when a job fails and resolves as soon as the job starts to succeed again. The advantages for us would be multiple:

a single notification on "failure start" and "failure end"
the possibility to have richer and nicer notifications thanks to AM receivers
the possibility to reuse the automation we have in place to spawn silences on the failing jobs

Point 3. is particularly appreciated by our devs that are used to handle alerts by themselves and silence according to their needs.

At the beginning we tried to play around with the rundeck_project_execution_status metric but it is clearly not suited for this. It is available for the time of the related execution and then goes stale. We also tried to aggregate with different operators but to no avail.

Any suggestions on an operator we can use for the purpose? Assuming you also think rundeck_project_execution_status is not viable and there is no metric for this use case, would you consider the addition and/or accept contributions for this? We think a metric that always return the result of the latest execution for each jobs would be perfect for this purpose. It could be hidden behind a flag so that the general use case for the exporter is kept intact. Thanks in advance.

phsmith commented 1 year ago

Hi @BaCaRoZzo, thanks!

I completely understand your situation, what I didn't understand very well is the part that you mention that

rundeck_project_execution_status goes stale...

Just to get things clear, rundeck_project_execution_status creates 5 metrics for each job status, for example:

# HELP rundeck_project_execution_status Rundeck Project ProjectName Execution Status
# TYPE rundeck_project_execution_status gauge
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="succeeded",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="running",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="failed",user="admin"} 1.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="aborted",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="unknown",user="admin"} 0.0

The metric that has it's values set as 1.0 has the current status of the job (succeeded, running, failed, aborted, or unknown).

With that in mind, it's possible to know things like count the number of failed jobs during the day.

sum by (job_name) (max_over_time(rundeck_project_execution_status{status="failed"}[1d]))

Your suggestion of have a metric that retrieves the status of the latest execution is exactly what rundeck_project_execution_status does.

My suggestion is that you explore a little bit more this metric in conjunction with Alertmanager attributes like group_by, group_wait and group_interval. I'm confident that you can accomplish your goals by only using it.

Oh, and the project is completely open for contributions :slightly_smiling_face:!!!

BaCaRoZzo commented 1 year ago

Hi Phillipe, thanks for the very quick reply, much appreciated.

The dimensions that rundeck_project_execution_status defines are clear to me, we also have other exporters that use a dimension to indicate the status. While I was reading your reply I had the epiphany that I didn't check the up metric. I think the gaps I'm seeing are because of some missing scrape which doesn't trigger our general target_down alert but causes the metrics to go stale. Snap! Now things start to make sense. I should have checked that first. 😄

Let me fix that and play more with the metrics when they are not stale any more. I'll get back at you, please leave the issue open for now as I think I can have further questions and the issue thread can be a useful documentation point for other users. Thanks a lot!

phsmith commented 1 year ago

Oh, man... I understood now what you mean about the metric getting stale. Yeah, the up metric surely gonna help in this case.

Yeah, I can keep the issue open without problem.

BaCaRoZzo commented 1 year ago

Hereby closing.

After the necessary fix on the scraping configuration the exporter works as expected. I had other questions but I cleared them by fiddling around the exporter itself. Thanks a lot for support @phsmith!

phsmith / rundeck_exporter

Alert on the failure of a single job #62