openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 364 forks source link

KubePodNotReady is being fired for pods with restartPolicy=RestartNever #72

Closed smarterclayton closed 4 years ago

smarterclayton commented 6 years ago

Seeing this on api.ci but a bunch of job pods are firing this alert but that is not valid. Most job pods never go ready.

Pods without RestartAlways restartPolicy shouldn't be part of KubePodNotReady.

jwforres commented 6 years ago

Can confirm - seeing this on some pods on free-int as well

s-urbaniak commented 6 years ago

I believe this would be tricky today as restartPolicy is not passed through via the kube-state-metrics I believe (/cc @brancz for verification).

What status are those job pods exactly in and what failure/exit mode is provoking those pods to be alerted? Looking at the alerting expression all job pods which exit with Succeeded should not fire any alerts:

sum by(namespace, pod) (kube_pod_status_phase{job="kube-state metrics",phase!~"Running|Succeeded"}) > 0

Wouldn't it make more sense to modify those job pods to exit with a Succeeded status?

smarterclayton commented 6 years ago

Jobs failing doesn't mean the cluster is broken. There are lots of pods that can run to completion and exit with status code 0 as part of normal operation.

In this case, this is ci jobs running on the cluster that use restart=Never pods. We'll always have some level of failing builds on a cluster as well (openshift builds run as pods).

smarterclayton commented 6 years ago

It's likely that we could constrain this to only system namespaces kube-* / openshift-*, in which case the check is probably fine for now.

brancz commented 6 years ago

Yes we have limiting things to a list of namespaces already on our radar: https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/56

I think we're going to have to do this before code freeze, as it produces too much noise.

brancz commented 6 years ago

For the time being this was fixed by https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/70, we may be able to improve this a bit more in the future but for now it's not producing false positives.

We'll need to update the jsonnet dependency to get this.

openshift-bot commented 4 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

s-urbaniak commented 4 years ago

The referenced mixins have been long merged, hence closing out.