Closed schwesig closed 1 month ago
Issue was identified to be related to pending CSRs preventing all nodes in the cluster from reaching Ready
state. Approving these allowed all nodes to transition to the Ready
state and restore the cluster.
For posterity, when a cluster is in this state (or similar down state) you can attempt local recovery by ssh'ing to a controller as core
user, sudoing to root, and running:
$ export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig
With that variable set, you can now run various oc
commands (e.g. oc get nodes
, oc get csr -A
, oc get co
, oc get events -A
, etc).
discussion started in Slack https://massopencloud.slack.com/archives/C027TDE52TZ/p1716992167955509