Closed smarterclayton closed 4 years ago
Can confirm - seeing this on some pods on free-int as well
I believe this would be tricky today as restartPolicy is not passed through via the kube-state-metrics I believe (/cc @brancz for verification).
What status are those job pods exactly in and what failure/exit mode is provoking those pods to be alerted? Looking at the alerting expression all job pods which exit with Succeeded
should not fire any alerts:
sum by(namespace, pod) (kube_pod_status_phase{job="kube-state metrics",phase!~"Running|Succeeded"}) > 0
Wouldn't it make more sense to modify those job pods to exit with a Succeeded
status?
Jobs failing doesn't mean the cluster is broken. There are lots of pods that can run to completion and exit with status code 0 as part of normal operation.
In this case, this is ci jobs running on the cluster that use restart=Never pods. We'll always have some level of failing builds on a cluster as well (openshift builds run as pods).
It's likely that we could constrain this to only system namespaces kube-*
/ openshift-*
, in which case the check is probably fine for now.
Yes we have limiting things to a list of namespaces already on our radar: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/56
I think we're going to have to do this before code freeze, as it produces too much noise.
For the time being this was fixed by https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/70, we may be able to improve this a bit more in the future but for now it's not producing false positives.
We'll need to update the jsonnet dependency to get this.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
The referenced mixins have been long merged, hence closing out.
Seeing this on api.ci but a bunch of job pods are firing this alert but that is not valid. Most job pods never go ready.
Pods without RestartAlways restartPolicy shouldn't be part of KubePodNotReady.