Open DanielFroehlich opened 1 month ago
We had one IPv6 Related change last night: https://redhat.service-now.com/help?id=rh_ticket&table=sc_req_item&sys_id=0be58f5b9773061cd658b82bf253af6c
But with ticket #188 i disabled IPv6 on all nodes at the primary interface.
This morning ~ 8:30 CEST I rebooted inf4 via BMC, because PING was not working. Ping to api.isar.coe.muc.redhat.com was also not possible. Looks like API VIP stuck on inf4 and doesn't change to inf5 or inf6.
FYI: KeepaliveD / API VIP & IngressVIP should only switch between control plane node. We disabled keepalived on all worker nodes via MacheneConfig.
After the reboot of inf4, all nodes were ready and looked good.
Lots of NoReady Nodes:
% oc get nodes
NAME STATUS ROLES AGE VERSION
ceph10 NotReady storage-node,worker 12d v1.29.6+aba1e8d
ceph11 NotReady storage-node,worker 12d v1.29.6+aba1e8d
ceph12 Ready storage-node,worker 12d v1.29.6+aba1e8d
inf4 Ready control-plane,master 273d v1.29.6+aba1e8d
inf44 Ready storage-node,worker 273d v1.29.6+aba1e8d
inf5 NotReady control-plane,master 273d v1.29.6+aba1e8d
inf6 Ready control-plane,master 273d v1.29.6+aba1e8d
inf7 NotReady storage-node,worker 273d v1.29.6+aba1e8d
inf8 Ready storage-node,worker 273d v1.29.6+aba1e8d
ucs-blade-server-1 Ready worker 13d v1.29.6+aba1e8d
ucs-blade-server-3 NotReady worker 13d v1.29.6+aba1e8d
ucs56 NotReady worker 195d v1.29.6+aba1e8d
ucs57 NotReady worker 194d v1.29.6+aba1e8d
Ping from laptop fails:
% ping ucs56.coe.muc.redhat.com
PING ucs56.coe.muc.redhat.com (10.32.96.56): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Ping from storm3 (also in COE Lap) also fails:
[root@storm3 ~]# ping ucs56.coe.muc.redhat.com
PING ucs56.coe.muc.redhat.com (10.32.96.56) 56(84) bytes of data.
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=1 Destination Host Unreachable
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=2 Destination Host Unreachable
From storm3.coe.muc.redhat.com (10.32.105.3) icmp_seq=3 Destination Host Unreachable
Robert states on slack::
We had one IPv6 related network change this night.
But I disabled IPv6 on all ISAR Nodes.
https://redhat.service-now.com/help?id=rh_ticket&table=sc_req_item&sys_id=0be58f5b9773061cd658b82bf253af6c
But can ping "ready" nodes. So I assume its not an IPV6 issue for the moment. Inf5 and inf6 (control node) is not reacting to ping, so seems we have lost quorum. Trying to reboot those two nodes..... ...inf5 console in IDRAC looks good, system seems to be online, but cant login without network and no user/password. Trying gracefull shutdown and restart.... ... inf5 reacts to pings again after restart.
Much better now, oc get nodes is responsive, looks like we have quorum back, but still lots of NotReady Nodes. --> rebooting inf6, too!
okay, control plane is fully back. Now rebooting ucs56 and 57, both not responding to pings, too.
ucs56 does not shutdown gracefully, console shows cephfs and libvirt/quemu messages. Doing a hard restart. --> node ready again. Still, console is unresponsive, rebooting inf7 (storage node) now, which also does not respond to pings. --> inf7 IDRAC is unresponsive, too????
reboots ucs57, too, to get more compute available.
its getting better now, SSO is still down. Looks like only one instance, which is schedule to inf7, which is still NotReady (even IDRAC not responding).
oc get pods -o=wide -n rhbk-operator
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coe-sso-0 1/1 Terminating 0 12d 10.128.16.37 inf7 <none> <none>
pq-for-rhbk-1 1/1 Running 1 12d 10.130.0.46 inf5 <none> <none>
pq-for-rhbk-3 1/1 Running 1 12d 10.128.0.45 inf4 <none> <none>
pq-for-rhbk-4 1/1 Running 1 12d 10.130.0.38 inf5 <none> <none>
rhbk-operator-58bc8c554f-cjvws 1/1 Terminating 0 12d 10.128.16.29 inf7 <none> <none>
rhbk-operator-58bc8c554f-dfccv 1/1 Running 0 130m 10.130.8.30 ucs57 <none> <none>
--> restarting those pods stuck in terminating pods:
oc delete pods coe-sso-0 rhbk-operator-58bc8c554f-cjvws
--> that does not help, need to apply more force to get it way:
oc delete --force pod/coe-sso-0
Et voila, login via sso is working again!
Most of the services are restored now, I am leaving the cluster in this state (inf7 and ucs-blade-3 NotReady), to allow for some post mortem and root cause analysis when @rbo is back.
Checked cepth11, kubelet log:
dial tcp: lookup api-int.isar.coe.muc.redhat.com on 10.32.96.1:53: no such host
DNS?
Added api-int.isar record to 10.32.98.1 like api.isar
oc get nodes
NAME STATUS ROLES AGE VERSION
ceph10 Ready storage-node,worker 13d v1.29.6+aba1e8d
ceph11 Ready storage-node,worker 13d v1.29.6+aba1e8d
ceph12 Ready storage-node,worker 13d v1.29.6+aba1e8d
inf4 Ready control-plane,master 274d v1.29.6+aba1e8d
inf44 Ready storage-node,worker 274d v1.29.6+aba1e8d
inf5 Ready control-plane,master 274d v1.29.6+aba1e8d
inf6 Ready control-plane,master 274d v1.29.6+aba1e8d
inf7 NotReady storage-node,worker 274d v1.29.6+aba1e8d
inf8 Ready storage-node,worker 274d v1.29.6+aba1e8d
ucs-blade-server-1 Ready worker 14d v1.29.6+aba1e8d
ucs-blade-server-3 NotReady worker 14d v1.29.6+aba1e8d
ucs56 Ready worker 195d v1.29.6+aba1e8d
ucs57 Ready worker 195d v1.29.6+aba1e8d
ceph10, for example, is flapping (Ready / NotReady)
ceph10 was flapping.... reboot fixed the problem.
I disabled with #188 IPv6 at all network configurations but did not a reboot! Means the configuration is applied to the main interface but not sync to the ovn k created br-ex this only happens after a reboot. Because the main interface is part of br-ex
Reboot of ceph11/ceph12 is in progress
ceph11/12 is fixed and not flapping anymore.
Reboot ucs-blade-server1/3 as well. (via BMC)
oc get nodes
NAME STATUS ROLES AGE VERSION
ceph10 Ready,SchedulingDisabled storage-node,worker 13d v1.29.6+aba1e8d
ceph11 Ready,SchedulingDisabled storage-node,worker 13d v1.29.6+aba1e8d
ceph12 Ready,SchedulingDisabled storage-node,worker 13d v1.29.6+aba1e8d
inf4 Ready control-plane,master 274d v1.29.6+aba1e8d
inf44 Ready storage-node,worker 274d v1.29.6+aba1e8d
inf5 Ready control-plane,master 274d v1.29.6+aba1e8d
inf6 Ready control-plane,master 274d v1.29.6+aba1e8d
inf7 NotReady storage-node,worker 274d v1.29.6+aba1e8d
inf8 Ready storage-node,worker 274d v1.29.6+aba1e8d
ucs-blade-server-1 NotReady worker 14d v1.29.6+aba1e8d
ucs-blade-server-3 NotReady worker 14d v1.29.6+aba1e8d
ucs56 Ready worker 196d v1.29.6+aba1e8d
ucs57 Ready worker 195d v1.29.6+aba1e8d
Isar Cluster is not responding, login fails with auth errors.