Metrics to identify stuck jobs

ns-gsa commented 11 months ago

What feature do you want to see added?

We would like a metric or metrics that would help us identify a job that is stuck.

We use jenkins jobs to run ansible playbooks over a large number of hosts, and in some sites the job might take a long time to run due to slowness in the underlying hosts, and sometimes job would simply be stuck

So the way we identify this today is by looking at the job console output log to see if there is any recent progress in the log. If there are no new log lines updated in the last 10 minutes or so, then we know that we need to diagnose that job further.

Since we have a huge scale of sites and jobs with ever increasing number of sites and jobs, it is always not possible to always eyeball job logs, and we would need some form of metrics to identify a stuck job log, so we can have alerting integrations to alert engineers.

we can use default_jenkins_builds_duration_milliseconds_summary_count and default_jenkins_builds_duration_milliseconds_summary_sum to find average runtime of job and compare with default_jenkins_builds_running_build_duration_milliseconds to know if a job has exceeded the average time, but that won't necessarily mean that the job is stuck.

Upstream changes

No response

Waschndolos commented 11 months ago

@ns-gsa Do you think a mixture of https://javadoc.jenkins.io/hudson/model/Job.html#isLogUpdated() and https://javadoc.jenkins-ci.org/hudson/model/Executor.html#isLikelyStuck() could give you what you need?

ns-gsa commented 11 months ago

@Waschndolos - the method names and the short description for those sounds promising

But I am not sure after looking at

Waschndolos commented 11 months ago

@ns-gsa I think I could only provide metrics with these methods. Maybe also a metric like "this job takes longer than usual" or something similar - but maybe that could not even be possible for Jenkins instances with a huge amount of jobs. Should I provide that?

ns-gsa commented 11 months ago

@Waschndolos - Apologies for my delay in getting back.

Yes can we have metric(s) so that we can know avg time taken for a job, as well as the time the currently running job has taken, so we can write all sorts of alert expressions like

current build time more than avg time
current build time more than avg time by X% etc.,

Also can we have a metric using isLogUpdated() and isLikelyStuck() for identifying stuck jobs. I can provide feedback on how this works after inspecting this metric behavior for a while in our production setup

Waschndolos commented 10 months ago

@ns-gsa I'll test the PR tomorrow in my companies Test Jenkins if I get the time

Waschndolos commented 10 months ago

Memo to me:

Only provide default_jenkins_builds_job_log_updated for running jobs
Rename default_jenkins_likely_stuck and insert job name

jenkinsci / prometheus-plugin

Metrics to identify stuck jobs #567

What feature do you want to see added?

Upstream changes