stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

OCP3 notes NotReady after long outage #91

Closed DanielFroehlich closed 1 year ago

DanielFroehlich commented 2 years ago

control-0 control-1 do not get ready, probably due to cert issues after long down time.

[root@ocp3support ~]# oc get nodes
NAME                                           STATUS     ROLES           AGE      VERSION
compute-0.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          536d     v1.21.8+ee73ea2
compute-1.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          536d     v1.21.8+ee73ea2
compute-2.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          536d     v1.21.8+ee73ea2
control-0.ocp3.stormshift.coe.muc.redhat.com   NotReady   master,worker   2y160d   v1.21.8+ee73ea2
control-1.ocp3.stormshift.coe.muc.redhat.com   NotReady   master,worker   2y160d   v1.21.8+ee73ea2
control-2.ocp3.stormshift.coe.muc.redhat.com   Ready      master,worker   2y160d   v1.21.8+ee73ea2

Probably needs cert recovery procedure applied. Please investigate

github-actions[bot] commented 2 years ago

Heads up @cluster/ocp3-admin - the "cluster/ocp3" label was applied to this issue.

DanielFroehlich commented 2 years ago

restarted ocp3 today after the long infra downtime. Needed to approve some CSRs, then the above mentioned problems appears again. @ortwinschneider , would you mind take a look, or ask @rbo for help? I am still guessing a cert issue as root cause.

rbo commented 2 years ago

Does the problem still exist?

DanielFroehlich commented 2 years ago

yes

rbo commented 2 years ago

All nodes are not Ready.

[root@ocp3support ~]# export KUBECONFIG=/root/ocp4install/auth/kubeconfig
[root@ocp3support ~]# oc get csr  | awk '/Pending/ {print $1}' | xargs oc adm certificate approve
[root@ocp3support ~]# oc get nodes
NAME                                           STATUS     ROLES           AGE      VERSION
compute-0.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          598d     v1.21.8+ee73ea2
compute-1.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          598d     v1.21.8+ee73ea2
compute-2.ocp3.stormshift.coe.muc.redhat.com   Ready      worker          598d     v1.21.8+ee73ea2
control-0.ocp3.stormshift.coe.muc.redhat.com   NotReady   master,worker   2y222d   v1.21.8+ee73ea2
control-1.ocp3.stormshift.coe.muc.redhat.com   NotReady   master,worker   2y222d   v1.21.8+ee73ea2
control-2.ocp3.stormshift.coe.muc.redhat.com   Ready      master,worker   2y222d   v1.21.8+ee73ea2
[root@ocp3support ~]# 

Better but not perfect.

rbo commented 2 years ago

control-0, fixed with https://github.com/stormshift/support/issues/72#issuecomment-1067121213

rbo commented 2 years ago

control-1 not available via ssh

rbo commented 2 years ago

Controle-1: image

Let's force reboot.

rbo commented 2 years ago

ah now it reboots... without any activity

rbo commented 2 years ago

stuck at the same point, looks like a reboot loop. Let's switch off and switch on in rhev.

rbo commented 2 years ago

Strange, I don't know. I suggest reinstalling control-1 and follow: https://docs.openshift.com/container-platform/4.8/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html

rbo commented 2 years ago

Who like to do this job? :-)

DanielFroehlich commented 1 year ago

ocp3 has been decomissioned. rest in peace!