Closed kusumachalasani closed 1 week ago
I feel the issue is with the network as I couldn't see logs of the pods which are running on wrk-5
node
[abharath@abharath-thinkpadt14sgen2i ~]$ oc logs -f nvidia-mig-manager-22t4c -n nvidia-gpu-operator
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/nvidia-gpu-operator/nvidia-mig-manager-22t4c/nvidia-mig-manager?follow=true": dial tcp 192.168.50.98:10250: connect: no route to host
/CC @tssala23 @dystewart
Node Affected: wrk-5 on the test-2 Kruize cluster.
Issue: coredns pod on wrk-5 is in CrashLoopBackOff, causing DNS failures.
Current Status: coredns-wrk-5 shows 1/2 readiness, with 2155 restarts within 19 hours.
Network problems on wrk-5 prevent pulling container images and connecting to prometheus-k8s for metrics.
Prometheus connectivity error shows a Connection timed out on prometheus-k8s.openshift-monitoring.svc.cluster.local:9091.
Prior Issues: GPU allocation issues were noted on wrk-5 before the DNS problem.
Last Successful Run: The node last worked successfully before a recent restart (earlier today.
Maybe Trigger: Issues with GPU allocation were initially reported, followed by a node restart, after which network and DNS issues began affecting image pulls and internal connectivity.
2024-11-07T11:38:15.000Z [Normal] Pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:38:02.000Z [Warning] Error: ImagePullBackOff
2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull
2024-11-07T11:37:48.000Z [Warning] Failed to pull image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1": rpc error: code = DeadlineExceeded desc = pinging container registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": dial tcp 172.30.21.204:5000: i/o timeout
2024-11-07T11:36:47.000Z [Normal] Started container oauth-proxy
2024-11-07T11:36:47.000Z [Normal] Created container oauth-proxy
2024-11-07T11:36:46.000Z [Normal] Successfully pulled image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46" in 16.491s (16.491s including waiting)
2024-11-07T11:36:29.000Z [Normal] Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46"
2024-11-07T11:35:29.000Z [Normal] Add eth0 [10.128.4.86/23] from ovn-kubernetes
2024-11-07T11:34:38.000Z [Normal] AttachVolume.Attach succeeded for volume "pvc-abea051c-53a6-43e8-8792-b8330bc9ea6d"
2024-11-07T11:34:37.806Z [Normal] Successfully assigned rhods-notebooks/jupyter-nb-schwesig-0 to wrk-5
Server requested
❯ oc debug node/wrk-5
Starting pod/wrk-5-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.98
If you don't see a command prompt, try pressing enter.
Removing debug pod ...
Error from server: error dialing backend: dial tcp 192.168.50.98:10250: connect: no route to host
❯ oc logs coredns-wrk-5
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host
❯ oc events -w
LAST SEEN TYPE REASON OBJECT MESSAGE
84s (x12799 over 7d19h) Warning ProbeError Pod/coredns-wrk-5 Liveness probe error: Get "http://192.168.50.98:18080/health": dial tcp 192.168.50.98:18080: connect: no route to host
body:
6m42s (x31518 over 7d19h) Warning BackOff Pod/coredns-wrk-5 Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
5m24s (x2233 over 7d19h) Warning Unhealthy Pod/keepalived-wrk-5 Liveness probe failed:
155m (x14 over 7d19h) Warning Unhealthy Pod/keepalived-wrk-5 Liveness probe failed: command timed out
^N1s (x31548 over 7d19h) Warning BackOff Pod/coredns-wrk-5 Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
0s Normal Pulled Pod/wrk-5-debug Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d6201c776053346ebce8f90c34797a7a7c05898008e17f3ba9673f5f14507b0" already present on machine
0s Normal Created Pod/wrk-5-debug Created container container-00
0s Normal Started Pod/wrk-5-debug Started container container-00
0s Normal Killing Pod/wrk-5-debug Stopping container container-00
0s (x2234 over 7d19h) Warning Unhealthy Pod/keepalived-wrk-5 Liveness probe failed:
0s (x2235 over 7d19h) Warning Unhealthy Pod/keepalived-wrk-5 Liveness probe failed:
trying to ping from wrk-4, not possible
sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# ssh -v core@192.168.50.98
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 192.168.50.98 [192.168.50.98] port 22.
debug1: connect to address 192.168.50.98 port 22: No route to host
ssh: connect to host 192.168.50.98 port 22: No route to host
sh-5.1# ip route
default via 192.168.50.1 dev br-ex proto dhcp src 192.168.50.149 metric 48
default via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102
default via 10.85.0.10 dev vlan2076 proto dhcp src 10.85.2.117 metric 400
10.0.120.0/22 dev eno2 proto kernel scope link src 10.0.123.127 metric 102
10.30.9.0/24 via 10.85.0.1 dev vlan2076 proto dhcp src 10.85.2.117 metric 400
10.85.0.0/22 dev vlan2076 proto kernel scope link src 10.85.2.117 metric 400
10.128.0.0/14 via 10.129.2.1 dev ovn-k8s-mp0
10.129.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.129.2.2
10.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102
10.255.116.0/23 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102
140.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2
169.254.169.1 dev br-ex src 192.168.50.149
169.254.169.3 via 10.129.2.1 dev ovn-k8s-mp0
169.254.169.254 via 192.168.50.11 dev br-ex proto dhcp src 192.168.50.149 metric 48
169.254.169.254 via 10.0.121.2 dev eno2 proto dhcp src 10.0.123.127 metric 102
172.30.0.0/16 via 169.254.169.4 dev br-ex src 169.254.169.2 mtu 1400
192.168.50.0/24 dev br-ex proto kernel scope link src 192.168.50.149 metric 48
❯ oc debug node/ctl-1
Starting pod/ctl-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.114
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ping 192.168.50.98
PING 192.168.50.98 (192.168.50.98) 56(84) bytes of data.
From 192.168.50.114 icmp_seq=1 Destination Host Unreachable
From 192.168.50.114 icmp_seq=2 Destination Host Unreachable
From 192.168.50.114 icmp_seq=3 Destination Host Unreachable
From 192.168.50.114 icmp_seq=4 Destination Host Unreachable
From 192.168.50.114 icmp_seq=5 Destination Host Unreachable
From 192.168.50.114 icmp_seq=6 Destination Host Unreachable
From 192.168.50.114 icmp_seq=7 Destination Host Unreachable
From 192.168.50.114 icmp_seq=8 Destination Host Unreachable
From 192.168.50.114 icmp_seq=12 Destination Host Unreachable
From 192.168.50.114 icmp_seq=15 Destination Host Unreachable
From 192.168.50.114 icmp_seq=16 Destination Host Unreachable
From 192.168.50.114 icmp_seq=18 Destination Host Unreachable
From 192.168.50.114 icmp_seq=21 Destination Host Unreachable
From 192.168.50.114 icmp_seq=22 Destination Host Unreachable
From 192.168.50.114 icmp_seq=24 Destination Host Unreachable
From 192.168.50.114 icmp_seq=25 Destination Host Unreachable
From 192.168.50.114 icmp_seq=26 Destination Host Unreachable
From 192.168.50.114 icmp_seq=27 Destination Host Unreachable
From 192.168.50.114 icmp_seq=28 Destination Host Unreachable
From 192.168.50.114 icmp_seq=29 Destination Host Unreachable
From 192.168.50.114 icmp_seq=30 Destination Host Unreachable
From 192.168.50.114 icmp_seq=31 Destination Host Unreachable
^C
--- 192.168.50.98 ping statistics ---
32 packets transmitted, 0 received, +22 errors, 100% packet loss, time 31749ms
pipe 4
wrk-5 says 192.168.50.98
but it is 192.168.50.93
in real
oc describe node/wrk-5 | grep .93
k8s.ovn.org/host-cidrs: ["10.0.120.39/22","10.85.3.145/22","192.168.50.93/24"]
{"default":{"mode":"local","interface-id":"br-ex_wrk-5","mac-address":"08:8f:c3:a6:03:8e","ip-addresses":["192.168.50.93/24"],"ip-address"...
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.50.93/24"}
System UUID: 0d3eb5fe-aba0-11ee-baa4-0a8fc3a60393
vs
❯ oc get node -o wide wrk-5
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
wrk-5 Ready worker 8d v1.29.5+29c95f3 192.168.50.98 <none> Red Hat Enterprise Linux CoreOS 416.94.202406172220-0 5.14.0-427.22.1.el9_4.x86_64 cri-o://1.29.5-5.rhaos4.16.git7032128.el9
Just tagging for easier finding /CC @jtriley @larsks @tzumainn
I'm actually not sure where the 192.168.50.98
came from in the first place. The Neutron port associated with MOC-R8PAC23U39
has 192.168.50.93 as its IP address; it was created on October 29th and hasn't been updated since, so I don't think it's been modified. And I don't see any port in the inventory that has the 192.168.50.98
IP. So I'm guessing that configuration came from outside of ESI?
In any case - I'm not that familiar with OpenShift configuration, but is it possible to just update the worker IP? Failing that, I could update the IP address of the port in ESI (and maybe reboot the machine). Let me know!
@schwesig I rebooted wrk-5 this morning, which caused it to re-register with the cluster. This resulted in a small number of pending certificate signing requests:
$ k get csr
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-9zbkw 45m kubernetes.io/kube-apiserver-client system:multus:ctl-0 24h Approved,Issued
csr-btxhn 72m kubernetes.io/kube-apiserver-client system:multus:wrk-3 24h Approved,Issued
csr-dhxvc 102m kubernetes.io/kube-apiserver-client system:ovn-node:wrk-1 24h Approved,Issued
csr-l6l2g 40m kubernetes.io/kube-apiserver-client system:ovn-node:ctl-2 24h Approved,Issued
csr-q6jxs 8s kubernetes.io/kubelet-serving system:node:wrk-5 <none> Pending
csr-rp8wg 15m kubernetes.io/kubelet-serving system:node:wrk-5 <none> Pending
csr-tplw4 16m kubernetes.io/kube-apiserver-client system:multus:wrk-0 24h Approved,Issued
csr-v592h 30m kubernetes.io/kubelet-serving system:node:wrk-5 <none> Pending
After approving these requests (oc adm certificate approve ...
), the node seems healthy and I am able to successfully schedule pods on the node.
There appears to be some issue with PV access on wrk-5. While the pods start and volumes mount successfully, actually writing to those volumes seems to block indefinitely.
Should we replace wrk-5 with a different server?
@hpdempsey I don't think we have a hardware problem, but if someone else wants to give that a shot they should feel free to have at it. Simply removing the node from the cluster and re-adding it should accomplish the same thing. Since this is a test environment, it seems like a good opportunity to figure out what's going on so that we understand better next time.
From my perspective it looks more like a networking issue.
The node is logging this every few seconds:
Nov 08 23:08:56 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
Nov 08 23:10:59 wrk-5 kernel: rbd: rbd1: encountered watch error: -107
Nov 08 23:11:02 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
@larsks reading through the thread seems like easiest solution would be to remove the node from the cluster and re add it, @schwesig am I clear to do that now?
@tssala23 , yes please, proceed.
@schwesig I have removed and added it back, you can check if the problem is still happening
@tssala23 thanks, wikll check
I was able to start a new notebook. kruize is informed and will do their tests.
i can close this issue. the core DNS problem is solved. correct IP etc
192.168.50.98
192.168.50.93
.98
Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host
Description
ImagePullBackOff error is observed on test-2-nerc.
Tried out different applications - but still the same error is observed. The application is trying to run on wrk-5 node.