okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.76k stars 297 forks source link

[OKD FCOS 4.15] OKD master unable to start up, boot logs location to see errors with oc #1995

Closed parseltongued closed 3 months ago

parseltongued commented 3 months ago

Describe the bug I have a 3-node OKD FCOS 4.15 airgap cluster and with an on-prem quay. On cluster restart, port 6443 is green on HAProxy but machine-config 22623 maintains red and console doesn't. I can't query anything with oc from my bastion server but I can ssh into each master. May I know how to check the boot-up ignition errors of each individual master node to see why it's failing? I've queried coreos-ignition-write-issues.service but doesnt show significant errors. image

On this note, would like to check besides communicating with the quay repo, is the apache httpd server master.ign and worker.ign important for the cluster startup? Although I my master.ign is accessible, I thought there's a csr expiry of 24hrs and would not be necessary after cluster initialization, is my theory correct?

Cluster environment OKD Cluster Version: 4.15.0-0.okd-2024-03-10-010116 Kernel version: v1.28.2-3598+6e2789bbd58938-dirty Installation method: Bare-metal UPI (Airgapped, self hosted quay)

melledouwsma commented 3 months ago

As port 6443 is up and you're able to SSH into the control-planes, I would use that route to find out some information about the cluster. You can use the node-kubeconfig files in /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/ to connect to the API.

For example, when connected to one of the control-plane nodes:

export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get clusterversion,co,nodes,csr

Based on the results you can find out what part of the cluster needs your attention.

parseltongued commented 3 months ago

Hi melledouwsma,

Thanks for your prompt reply.

Below is a screengrab after running the command above, certificatesigningrequest.certificates.k8s.io/csr-57g7j 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-9tsrs 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending certificatesigningrequest.certificates.k8s.io/csr-tstwl 14m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending image

I had previously made changes to the network adapter settings prior to this in an attempt to change to another subnet range, but I hope vm snapshot should have reverted all of this. is there anyway to further debug and fix this?

Thank you

melledouwsma commented 3 months ago

This cluster has Pending internal certificates, which could be due to restoring the cluster from backup snapshots. To resolve this issue, follow these steps to approve the pending certificates:

export KUBECONFIG="/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig"
oc get csr | grep Pending | awk '{print $1}' | xargs oc adm certificate approve
oc get csr

When new Pending certificates appear, use the previous step to approve them. This proces can take a couple of minutes. After a few minutes, when all certificates are issued and no new pending certificates appear, check the external API endpoint.

parseltongued commented 3 months ago

Hi Melle, my cluster revived and it works! thanks so much for your instantaneous support on my case!