ISAR is not responsive - Githubissues

DanielFroehlich commented 1 month ago

Isar Cluster is not responding, login fails with auth errors.

rbo commented 1 month ago

We had one IPv6 Related change last night: https://redhat.service-now.com/help?id=rh_ticket&table=sc_req_item&sys_id=0be58f5b9773061cd658b82bf253af6c

But with ticket #188 i disabled IPv6 on all nodes at the primary interface.

rbo commented 1 month ago

This morning ~ 8:30 CEST I rebooted inf4 via BMC, because PING was not working. Ping to api.isar.coe.muc.redhat.com was also not possible. Looks like API VIP stuck on inf4 and doesn't change to inf5 or inf6.

FYI: KeepaliveD / API VIP & IngressVIP should only switch between control plane node. We disabled keepalived on all worker nodes via MacheneConfig.

After the reboot of inf4, all nodes were ready and looked good.

DanielFroehlich commented 1 month ago

Lots of NoReady Nodes:

% oc get nodes
NAME                 STATUS     ROLES                  AGE    VERSION
ceph10               NotReady   storage-node,worker    12d    v1.29.6+aba1e8d
ceph11               NotReady   storage-node,worker    12d    v1.29.6+aba1e8d
ceph12               Ready      storage-node,worker    12d    v1.29.6+aba1e8d
inf4                 Ready      control-plane,master   273d   v1.29.6+aba1e8d
inf44                Ready      storage-node,worker    273d   v1.29.6+aba1e8d
inf5                 NotReady   control-plane,master   273d   v1.29.6+aba1e8d
inf6                 Ready      control-plane,master   273d   v1.29.6+aba1e8d
inf7                 NotReady   storage-node,worker    273d   v1.29.6+aba1e8d
inf8                 Ready      storage-node,worker    273d   v1.29.6+aba1e8d
ucs-blade-server-1   Ready      worker                 13d    v1.29.6+aba1e8d
ucs-blade-server-3   NotReady   worker                 13d    v1.29.6+aba1e8d
ucs56                NotReady   worker                 195d   v1.29.6+aba1e8d
ucs57                NotReady   worker                 194d   v1.29.6+aba1e8d

Ping from laptop fails:

% ping ucs56.coe.muc.redhat.com
PING ucs56.coe.muc.redhat.com (10.32.96.56): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1

Ping from storm3 (also in COE Lap) also fails:

[root@storm3 ~]# ping ucs56.coe.muc.redhat.com
PING ucs56.coe.muc.redhat.com (10.32.96.56) 56(84) bytes of data.
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=1 Destination Host Unreachable
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=2 Destination Host Unreachable
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=3 Destination Host Unreachable

Robert states on slack::

We had one IPv6 related network change this night.
But I disabled IPv6 on all ISAR Nodes.
https://redhat.service-now.com/help?id=rh_ticket&table=sc_req_item&sys_id=0be58f5b9773061cd658b82bf253af6c

But can ping "ready" nodes. So I assume its not an IPV6 issue for the moment. Inf5 and inf6 (control node) is not reacting to ping, so seems we have lost quorum. Trying to reboot those two nodes..... ...inf5 console in IDRAC looks good, system seems to be online, but cant login without network and no user/password. Trying gracefull shutdown and restart.... ... inf5 reacts to pings again after restart.

Much better now, oc get nodes is responsive, looks like we have quorum back, but still lots of NotReady Nodes. --> rebooting inf6, too!

okay, control plane is fully back. Now rebooting ucs56 and 57, both not responding to pings, too.

ucs56 does not shutdown gracefully, console shows cephfs and libvirt/quemu messages. Doing a hard restart. --> node ready again. Still, console is unresponsive, rebooting inf7 (storage node) now, which also does not respond to pings. --> inf7 IDRAC is unresponsive, too????

reboots ucs57, too, to get more compute available.

DanielFroehlich commented 1 month ago

its getting better now, SSO is still down. Looks like only one instance, which is schedule to inf7, which is still NotReady (even IDRAC not responding).

oc get pods -o=wide -n rhbk-operator
NAME                             READY   STATUS        RESTARTS   AGE    IP             NODE    NOMINATED NODE   READINESS GATES
coe-sso-0                        1/1     Terminating   0          12d    10.128.16.37   inf7    <none>           <none>
pq-for-rhbk-1                    1/1     Running       1          12d    10.130.0.46    inf5    <none>           <none>
pq-for-rhbk-3                    1/1     Running       1          12d    10.128.0.45    inf4    <none>           <none>
pq-for-rhbk-4                    1/1     Running       1          12d    10.130.0.38    inf5    <none>           <none>
rhbk-operator-58bc8c554f-cjvws   1/1     Terminating   0          12d    10.128.16.29   inf7    <none>           <none>
rhbk-operator-58bc8c554f-dfccv   1/1     Running       0          130m   10.130.8.30    ucs57   <none>           <none>

--> restarting those pods stuck in terminating pods: oc delete pods coe-sso-0 rhbk-operator-58bc8c554f-cjvws

--> that does not help, need to apply more force to get it way: oc delete --force pod/coe-sso-0

Et voila, login via sso is working again!

Most of the services are restored now, I am leaving the cluster in this state (inf7 and ucs-blade-3 NotReady), to allow for some post mortem and root cause analysis when @rbo is back.

rbo commented 1 month ago

Checked cepth11, kubelet log:

dial tcp: lookup api-int.isar.coe.muc.redhat.com on 10.32.96.1:53: no such host

DNS?

rbo commented 1 month ago

Added api-int.isar record to 10.32.98.1 like api.isar

rbo commented 1 month ago

https://docs.openshift.com/container-platform/4.16/installing/installing_with_agent_based_installer/preparing-to-install-with-agent-based-installer.html#agent-install-dns-none_preparing-to-install-with-agent-based-installer

api-int & api is required.

rbo commented 1 month ago

oc get nodes
NAME                 STATUS     ROLES                  AGE    VERSION
ceph10               Ready      storage-node,worker    13d    v1.29.6+aba1e8d
ceph11               Ready      storage-node,worker    13d    v1.29.6+aba1e8d
ceph12               Ready      storage-node,worker    13d    v1.29.6+aba1e8d
inf4                 Ready      control-plane,master   274d   v1.29.6+aba1e8d
inf44                Ready      storage-node,worker    274d   v1.29.6+aba1e8d
inf5                 Ready      control-plane,master   274d   v1.29.6+aba1e8d
inf6                 Ready      control-plane,master   274d   v1.29.6+aba1e8d
inf7                 NotReady   storage-node,worker    274d   v1.29.6+aba1e8d
inf8                 Ready      storage-node,worker    274d   v1.29.6+aba1e8d
ucs-blade-server-1   Ready      worker                 14d    v1.29.6+aba1e8d
ucs-blade-server-3   NotReady   worker                 14d    v1.29.6+aba1e8d
ucs56                Ready      worker                 195d   v1.29.6+aba1e8d
ucs57                Ready      worker                 195d   v1.29.6+aba1e8d

rbo commented 1 month ago

ceph10, for example, is flapping (Ready / NotReady)

rbo commented 1 month ago

ceph10 was flapping.... reboot fixed the problem.

I disabled with #188 IPv6 at all network configurations but did not a reboot! Means the configuration is applied to the main interface but not sync to the ovn k created br-ex this only happens after a reboot. Because the main interface is part of br-ex

rbo commented 1 month ago

Reboot of ceph11/ceph12 is in progress

rbo commented 1 month ago

ceph11/12 is fixed and not flapping anymore.

rbo commented 1 month ago

Reboot ucs-blade-server1/3 as well. (via BMC)

rbo commented 1 month ago

 oc get nodes
NAME                 STATUS                     ROLES                  AGE    VERSION
ceph10               Ready,SchedulingDisabled   storage-node,worker    13d    v1.29.6+aba1e8d
ceph11               Ready,SchedulingDisabled   storage-node,worker    13d    v1.29.6+aba1e8d
ceph12               Ready,SchedulingDisabled   storage-node,worker    13d    v1.29.6+aba1e8d
inf4                 Ready                      control-plane,master   274d   v1.29.6+aba1e8d
inf44                Ready                      storage-node,worker    274d   v1.29.6+aba1e8d
inf5                 Ready                      control-plane,master   274d   v1.29.6+aba1e8d
inf6                 Ready                      control-plane,master   274d   v1.29.6+aba1e8d
inf7                 NotReady                   storage-node,worker    274d   v1.29.6+aba1e8d
inf8                 Ready                      storage-node,worker    274d   v1.29.6+aba1e8d
ucs-blade-server-1   NotReady                   worker                 14d    v1.29.6+aba1e8d
ucs-blade-server-3   NotReady                   worker                 14d    v1.29.6+aba1e8d
ucs56                Ready                      worker                 196d   v1.29.6+aba1e8d
ucs57                Ready                      worker                 195d   v1.29.6+aba1e8d

stormshift / support

ISAR is not responsive #195