Closed djhoese closed 3 years ago
See https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs for spec.completions information.
Hm, I'm not sure how kube_job_complete
is supposed to work, but maybe kube_job_spec_completions - (kube_job_status_active + kube_job_status_succeeded)
would be more accurate? It still misses the case where a Job waits for multiple completions (multiple pods) but not all of the pods are active or complete yet.
In kubernetes mixin job failure and job completion are decoupled to prevent issues like on you have. You can see an example in https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L237-L264 or generated versions in https://monitoring.mixins.dev/kubernetes/ (KubeJobCompletion and KubeJobFailed alerts)
I see. So the main thing there for the KubeJobCompletion
is that it has a 12h wait time on it. Thanks. Maintainers, feel free to close, or let me know if you want a pull request with some updates (and tell me which files have to be changed to do it properly for this repository).
So the main thing there for the KubeJobCompletion is that it has a 12h wait time on it.
Yes, plus it has a different severity.
Hi @djhoese and @paulfantom
I made a small change to the alert template (custom for:
parameter) and fixed the query ;)
https://github.com/samber/awesome-prometheus-alerts/commit/3a352d08dc5698c55fc05300d54f48933aae3012 + https://github.com/samber/awesome-prometheus-alerts/commit/a6bf7d11681048b059dae3c3a31f9798cd08899e
I could be misunderstanding the purpose of this, but I'm seeing some weird behavior with this rule. The rule is currently defined as:
kube_job_spec_completions
will always be some number 1 or higher. It is the number of completions configured for the job. It isn't the number of completed pods for the job; that should bekube_job_complete
. While the job is running the result of the first part of this expression will always be true (1 - 0 > 0). So if the job doesn't finish within 5 minutes, this alert will always fire and then get resolved later when the job does succeed.I think this should be changed to
kube_job_complete
, right? Otherwise the rule should be adjusted in thefor:
to be as long as a job could take to succeed (not very accurate or flexible, I think).