Closed DanielFroehlich closed 2 years ago
Heads up @cluster/ocp3-admin - the "cluster/ocp3" label was applied to this issue.
$ ssh root@stormshiftdeploy.coe.muc.redhat.com
ssh: connect to host stormshiftdeploy.coe.muc.redhat.com port 22: No route to host
:-(
oc client version is to old, updated oc client to latest stable 4.8 version:
[root@ocp3support ~]# oc version
Client Version: openshift-clients-4.2.2-201910250432-4-g4ac90784
Server Version: 4.8.24
Kubernetes Version: v1.21.6+c180a7c
[root@ocp3support ~]# type oc
oc is hashed (/root/bin/oc)
[root@ocp3support ~]# cd bin/
[root@ocp3support bin]# ls
kubectl oc openshift-install
[root@ocp3support bin]# curl -L -O https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable-4.8/openshift-client-linux.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 47.4M 100 47.4M 0 0 50.9M 0 --:--:-- --:--:-- --:--:-- 50.9M
[root@ocp3support bin]# tar xzvf openshift-client-linux.tar.gz oc kubectl
oc
kubectl
[root@ocp3support bin]# oc version
Client Version: 4.8.33
Server Version: 4.8.24
Kubernetes Version: v1.21.6+c180a7c
Approve all prending certificates: oc get csr | awk '/Pending/ {print $1}' | xargs oc adm certificate approve
Looks much better:
[root@ocp3support bin]# oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
compute-1.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
compute-2.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
control-0.ocp3.stormshift.coe.muc.redhat.com Ready master,worker 2y54d v1.21.6+c180a7c
control-1.ocp3.stormshift.coe.muc.redhat.com NotReady master,worker 2y54d v1.21.6+c180a7c
control-2.ocp3.stormshift.coe.muc.redhat.com Ready master,worker 2y54d v1.21.6+c180a7c
[root@ocp3support bin]#
Shutdown control-1.ocp3.stormshift.coe.muc.redhat.com
& enable rhev console
Console looks good:
Node is still notready.
I don't have ssh access because stormshiftdeploy is still not available.
For ssh access:
dfroehli@dfroehli-mac21 ~ % ssh ocp3bastion.stormshift.coe.muc.redhat.com
Last login: Mon Mar 14 17:47:47 2022 from 10.39.194.46
[root@ocp3bastion ~]# ssh core@172.16.10.11
Red Hat Enterprise Linux CoreOS 48.84.202112022303-0
[core@control-1 ~]$
For kubeconfig access:
% ssh ocp3bastion.stormshift.coe.muc.redhat.com
Last login: Mon Mar 14 17:49:50 2022 from 10.39.194.46
[root@ocp3bastion ~]# ssh root@ocp3support.stormshift.coe.muc.redhat.com
Last login: Mon Mar 14 17:13:44 2022 from 172.16.10.1
[root@ocp3support ~]# export KUBECONFIG=/root/ocp4install/auth/kubeconfig
stormshift deploy vm has moved from COE RHV to stormshift RHV, was also affected from the HW issue last week and thus down. Its now online again after I started the VM:
% ssh root@stormshiftdeploy.coe.muc.redhat.com
Last login: Mon Mar 7 15:46:21 2022 from 10.39.194.156
[root@stormshiftdeploy ~]#
Let's recover kubelet:
# Created temp admin kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl config set-cluster localhost --insecure-skip-tls-verify=true --server=https://localhost:6443
cd /etc/kubernetes/static-pod-resources/kube-apiserver-pod-*/secrets/localhost-recovery-client-token
kubectl config set-credentials localhost --token=$(cat token )
kubectl config set-context localhost --cluster=localhost --user=localhost
kubectl config use-context localhost
# Create recovery kubeconfig for control-2
recover-kubeconfig.sh > /tmp/recovery-kubeconfig
oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /tmp/rec-etc-kubernetes-ca.crt
Transfer /tmp/recovery-kubeconfig and /tmp/rec-etc-kubernetes-ca.crt to controle-2
systemctl stop kubelet
cp /tmp/recovery-kubeconfig /etc/kubernetes/kubeconfig
cp /tmp/rec-etc-kubernetes-ca.crt /etc/kubernetes/ca.crt
touch /run/machine-config-daemon-force
rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
systemctl start kubelet
Bascilly followed: https://github.com/stormshift/support/issues/46#issuecomment-951242834
Approved CSR's of control-2
[root@ocp3support ~]# oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
compute-1.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
compute-2.ocp3.stormshift.coe.muc.redhat.com Ready worker 430d v1.21.6+c180a7c
control-0.ocp3.stormshift.coe.muc.redhat.com Ready master,worker 2y54d v1.21.6+c180a7c
control-1.ocp3.stormshift.coe.muc.redhat.com Ready master,worker 2y54d v1.21.6+c180a7c
control-2.ocp3.stormshift.coe.muc.redhat.com Ready master,worker 2y54d v1.21.6+c180a7c
[root@ocp3support ~]#
LGTM THX
OCP3 was (partially) down for a couple of days due to HW issues. After a clean re-boot, nodes compute-1, control-1 control-2 are not ready:
Seems certs did expire, I can see new csr's:
But approving the csr's fails:
That's the end of my troubleshooting skills. I can only speculate that as two of three control nodes are not ready, cluster has no quorum. I think there is some docs on how to recover from this, but I can find them atm.