samber / awesome-prometheus-alerts

🚨 Collection of Prometheus alerting rules
https://samber.github.io/awesome-prometheus-alerts/
Other
6.62k stars 1.02k forks source link

KubernetesJobCompletion will always fire while job is running #157

Closed djhoese closed 3 years ago

djhoese commented 4 years ago

I could be misunderstanding the purpose of this, but I'm seeing some weird behavior with this rule. The rule is currently defined as:

  - alert: KubernetesJobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Kubernetes job completion (instance {{ $labels.instance }})"
      description: "Kubernetes Job failed to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

kube_job_spec_completions will always be some number 1 or higher. It is the number of completions configured for the job. It isn't the number of completed pods for the job; that should be kube_job_complete. While the job is running the result of the first part of this expression will always be true (1 - 0 > 0). So if the job doesn't finish within 5 minutes, this alert will always fire and then get resolved later when the job does succeed.

I think this should be changed to kube_job_complete, right? Otherwise the rule should be adjusted in the for: to be as long as a job could take to succeed (not very accurate or flexible, I think).

djhoese commented 4 years ago

See https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs for spec.completions information.

djhoese commented 4 years ago

Hm, I'm not sure how kube_job_complete is supposed to work, but maybe kube_job_spec_completions - (kube_job_status_active + kube_job_status_succeeded) would be more accurate? It still misses the case where a Job waits for multiple completions (multiple pods) but not all of the pods are active or complete yet.

paulfantom commented 4 years ago

In kubernetes mixin job failure and job completion are decoupled to prevent issues like on you have. You can see an example in https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/apps_alerts.libsonnet#L237-L264 or generated versions in https://monitoring.mixins.dev/kubernetes/ (KubeJobCompletion and KubeJobFailed alerts)

djhoese commented 4 years ago

I see. So the main thing there for the KubeJobCompletion is that it has a 12h wait time on it. Thanks. Maintainers, feel free to close, or let me know if you want a pull request with some updates (and tell me which files have to be changed to do it properly for this repository).

paulfantom commented 4 years ago

So the main thing there for the KubeJobCompletion is that it has a 12h wait time on it.

Yes, plus it has a different severity.

samber commented 3 years ago

Hi @djhoese and @paulfantom

I made a small change to the alert template (custom for: parameter) and fixed the query ;)

https://github.com/samber/awesome-prometheus-alerts/commit/3a352d08dc5698c55fc05300d54f48933aae3012 + https://github.com/samber/awesome-prometheus-alerts/commit/a6bf7d11681048b059dae3c3a31f9798cd08899e