OKD went into failed update status overnight without input resulting in a master node status being NotReady

schmts commented 5 months ago

I've been having troubles with okd since this morning. We're on 4.15.0-0.okd-2024-03-10-010116 and over night OKD began to think it's updating. No one in our team was doing anything since yesterday afternoon(when everything seemed fine) so I suspect something else might've gone wrong.

In The Cluster Settings tab it's reporting a failing update with the message: "Multiple errors are preventing progress: Cluster operator machine-config is not available Cluster operators authentication, etcd, kube-apiserver, kube-controller-manager, kube-scheduler, openshift-apiserver are degraded."

The "oc get mcp" is reporting the master machineconfigpool updating, but none degraded with 2 ready out of 3 machines. And when I look at the nodes themselves, one of the master nodes is with an "NotReady" status. I've also not managed to open a debug pod, to open a terminal or ssh into the node yet.

The config-policy-controller pod is in a CrashLoopBackOff state with logs reporting the following errors:

In the config-policy-controller container: 024-06-13T10:40:46.225Z error controller-runtime.source source/source.go:143 if kind is a CRD, it should be installed before calling Start {"kind": "OperatorPolicy.policy.open-cluster-management.io", "error": "no matches for kind \"OperatorPolicy\" in version \"policy.open-cluster-management.io/v1beta1\""} sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1

2024-06-13T10:42:46.226Z error setup app/main.go:511 Problem running manager {"error": "failed to wait for operator-policy-controller caches to sync: timed out waiting for cache to be synced"} main.main.func5

And in the kube-rbac-proxy container: I0613 10:22:07.400608 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews 201 Created in 9 milliseconds I0613 10:22:07.404735 1 round_trippers.go:443] POST https://172.30.0.1:443/apis/authorization.k8s.io/v1/subjectaccessreviews 201 Created in 3 milliseconds 2024/06/13 10:22:07 http: proxy error: dial tcp 127.0.0.1:8383: connect: connection refused

I've also noticed the machine config daemon on the affected node to be unreachable via terminal. I've deleted the pod with hopes of it picking up, but it's in a pending state right now.

titou10titou10 commented 5 months ago

it seems your node is in bad shape...did you tried to simply reboot the node?

schmts commented 5 months ago

Yep. A reboot helped. It might be that we're running out of resources and the node went awry.

okd-project / okd

OKD went into failed update status overnight without input resulting in a master node status being NotReady #1949