stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

OCP4 NotReady after COE Lab re-build #78

Closed DanielFroehlich closed 2 years ago

DanielFroehlich commented 2 years ago

OCP4 Cluster is in INOP state after rebuild, login not working etc. Looks like we have lost quorum, too many not ready nodes:

[root@ocp4bastion ~]# oc get nodes
NAME                                           STATUS     ROLES    AGE     VERSION
compute-0.ocp4.stormshift.coe.muc.redhat.com   Ready      worker   2y86d   v1.21.6+4b61f94
compute-1.ocp4.stormshift.coe.muc.redhat.com   NotReady   worker   2y86d   v1.21.6+4b61f94
compute-2.ocp4.stormshift.coe.muc.redhat.com   Ready      worker   550d    v1.21.6+4b61f94
control-0.ocp4.stormshift.coe.muc.redhat.com   Ready      master   2y86d   v1.21.6+4b61f94
control-1.ocp4.stormshift.coe.muc.redhat.com   NotReady   master   2y86d   v1.21.6+4b61f94
control-2.ocp4.stormshift.coe.muc.redhat.com   NotReady   master   2y86d   v1.21.6+4b61f94
gpu.ocp4.stormshift.coe.muc.redhat.com         Ready      worker   157d    v1.21.6+4b61f94

I suspect cert issue due to long days offline. I did approve CSR in state pending, but that did not help. We need to invistigate and probably recover from expirired certs. We had this issue already, please search here for the resulotion.

github-actions[bot] commented 2 years ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

DanielFroehlich commented 2 years ago

@Javatar81 would you be able to take a look?

DanielFroehlich commented 2 years ago

Please check if it is the same as https://github.com/stormshift/support/issues/46

DanielFroehlich commented 2 years ago

control-1 (NotReady):

[core@control-1 ~]$  systemctl status kubelet
Apr 13 09:38:33 control-1.ocp4.stormshift.coe.muc.redhat.com hyperkube[126009]: I0413 09:38:33.812969  126009 csi_plugin.go:1031] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "control-1.ocp4.stormshift.coe.muc.redhat.com" is forbidden: User "system:anonymous" cannot get resource "csinodes" in API g>
Apr 13 09:38:33 control-1.ocp4.stormshift.coe.muc.redhat.com hyperkube[126009]: E0413 09:38:33.907551  126009 kubelet.go:2303] "Error getting node" err="node \"control-1.ocp4.stormshift.coe.muc.redhat.com\" not found"
Javatar81 commented 2 years ago

Identified problem with Kubelets not working. Followed these docs

systemctl status kubelet found kubelet.go:2303] "Error getting node" err="node \"[control-1.ocp4.stormshift.coe.muc.redhat.com](http://control-1.ocp4.stormshift.coe.muc.redhat.com/)\" not found"

Recovering as described in this issue: https://github.com/stormshift/support/issues/72

Approved CSRs and enabled scheduling

Javatar81 commented 2 years ago

All nodes are ready

DanielFroehlich commented 2 years ago

LGTM, THX