Closed BaCaRoZzo closed 1 year ago
Hi @BaCaRoZzo, thanks!
I completely understand your situation, what I didn't understand very well is the part that you mention that
rundeck_project_execution_status goes stale...
Just to get things clear, rundeck_project_execution_status
creates 5 metrics for each job status, for example:
# HELP rundeck_project_execution_status Rundeck Project ProjectName Execution Status
# TYPE rundeck_project_execution_status gauge
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="succeeded",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="running",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="failed",user="admin"} 1.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="aborted",user="admin"} 0.0
rundeck_project_execution_status{execution_id="1422",execution_type="scheduled",instance_address="rundeck:4440",job_group="",job_id="87e1c9a3-112f-4606-af3f-11eb78415b20",job_name="Fail after 10s",project_name="test-1",status="unknown",user="admin"} 0.0
The metric that has it's values set as 1.0
has the current status of the job (succeeded, running, failed, aborted, or unknown).
With that in mind, it's possible to know things like count the number of failed jobs during the day.
sum by (job_name) (max_over_time(rundeck_project_execution_status{status="failed"}[1d]))
Your suggestion of have a metric that retrieves the status of the latest execution is exactly what rundeck_project_execution_status
does.
My suggestion is that you explore a little bit more this metric in conjunction with Alertmanager attributes like group_by
, group_wait
and group_interval
. I'm confident that you can accomplish your goals by only using it.
Oh, and the project is completely open for contributions :slightly_smiling_face:!!!
Hi Phillipe, thanks for the very quick reply, much appreciated.
The dimensions that rundeck_project_execution_status
defines are clear to me, we also have other exporters that use a dimension to indicate the status. While I was reading your reply I had the epiphany that I didn't check the up
metric. I think the gaps I'm seeing are because of some missing scrape which doesn't trigger our general target_down
alert but causes the metrics to go stale.
Snap! Now things start to make sense. I should have checked that first. 😄
Let me fix that and play more with the metrics when they are not stale any more. I'll get back at you, please leave the issue open for now as I think I can have further questions and the issue thread can be a useful documentation point for other users. Thanks a lot!
Oh, man... I understood now what you mean about the metric getting stale. Yeah, the up
metric surely gonna help in this case.
Yeah, I can keep the issue open without problem.
Hereby closing.
After the necessary fix on the scraping configuration the exporter works as expected. I had other questions but I cleared them by fiddling around the exporter itself. Thanks a lot for support @phsmith!
Hi and thanks for the awesome exporter!
We would like to replace the notification system of
rundeck
with alerts generated by the AlertManager. When a job starts to fail for external reasons - e.g. an endpoint with network issues -rundeck
sends a notification for each and every failure. In case of jobs with a sub-hourly schedule, that results in a spamming of ourslack
channels. In those cases our devs have to disable temporary the failure notification inrundeck
itself, which is both annoying and impractical.We were thinking it would be nice to alert on failure via AlertManager. The idea is that an alert starts to fire when a job fails and resolves as soon as the job starts to succeed again. The advantages for us would be multiple:
Point 3. is particularly appreciated by our devs that are used to handle alerts by themselves and silence according to their needs.
At the beginning we tried to play around with the
rundeck_project_execution_status
metric but it is clearly not suited for this. It is available for the time of the related execution and then goes stale. We also tried to aggregate with different operators but to no avail.Any suggestions on an operator we can use for the purpose? Assuming you also think
rundeck_project_execution_status
is not viable and there is no metric for this use case, would you consider the addition and/or accept contributions for this? We think a metric that always return the result of the latest execution for each jobs would be perfect for this purpose. It could be hidden behind a flag so that the general use case for the exporter is kept intact. Thanks in advance.