Test Cluster still down after maintainance

nerc-project / operations

Issues related to the operation of the NERC OpenShift environment

1 stars 0 forks source link

Test Cluster still down after maintainance #593

Closed schwesig closed 1 month ago

schwesig commented 1 month ago

discussion started in Slack https://massopencloud.slack.com/archives/C027TDE52TZ/p1716992167955509

jtriley commented 1 month ago

Issue was identified to be related to pending CSRs preventing all nodes in the cluster from reaching Ready state. Approving these allowed all nodes to transition to the Ready state and restore the cluster.

jtriley commented 1 month ago

For posterity, when a cluster is in this state (or similar down state) you can attempt local recovery by ssh'ing to a controller as core user, sudoing to root, and running:

$ export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig

With that variable set, you can now run various oc commands (e.g. oc get nodes, oc get csr -A, oc get co, oc get events -A, etc).