Closed parseltongued closed 3 months ago
As port 6443 is up and you're able to SSH into the control-planes, I would use that route to find out some information about the cluster. You can use the node-kubeconfig files in /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/
to connect to the API.
For example, when connected to one of the control-plane nodes:
export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get clusterversion,co,nodes,csr
Based on the results you can find out what part of the cluster needs your attention.
Hi melledouwsma,
Thanks for your prompt reply.
Below is a screengrab after running the command above,
certificatesigningrequest.certificates.k8s.io/csr-57g7j 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-9tsrs 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-tstwl 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
I had previously made changes to the network adapter settings prior to this in an attempt to change to another subnet range, but I hope vm snapshot should have reverted all of this. is there anyway to further debug and fix this?
Thank you
This cluster has Pending
internal certificates, which could be due to restoring the cluster from backup snapshots. To resolve this issue, follow these steps to approve the pending certificates:
export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve
oc get csr
When new Pending
certificates appear, use the previous step to approve them. This proces can take a couple of minutes. After a few minutes, when all certificates are issued and no new pending certificates appear, check the external API endpoint.
Hi Melle, my cluster revived and it works! thanks so much for your instantaneous support on my case!
Describe the bug I have a 3-node OKD FCOS 4.15 airgap cluster and with an on-prem quay. On cluster restart, port 6443 is green on HAProxy but machine-config 22623 maintains red and console doesn't. I can't query anything with oc from my bastion server but I can ssh into each master. May I know how to check the boot-up ignition errors of each individual master node to see why it's failing? I've queried coreos-ignition-write-issues.service but doesnt show significant errors.
On this note, would like to check besides communicating with the quay repo, is the apache httpd server master.ign and worker.ign important for the cluster startup? Although I my master.ign is accessible, I thought there's a csr expiry of 24hrs and would not be necessary after cluster initialization, is my theory correct?
Cluster environment OKD Cluster Version: 4.15.0-0.okd-2024-03-10-010116 Kernel version: v1.28.2-3598+6e2789bbd58938-dirty Installation method: Bare-metal UPI (Airgapped, self hosted quay)