bug: kruize: ImagePullBackOff error on test-2-nerc - Githubissues

nerc-project / operations

Issues related to the operation of the NERC OpenShift environment

2 stars 0 forks source link

bug: kruize: ImagePullBackOff error on test-2-nerc #804

Closed kusumachalasani closed 1 week ago

kusumachalasani commented 2 weeks ago

kruize: nerc-ocp-test-2.nerc.mghpcc.org
wrk-5 (GPU node, NVIDIA-A100-SXM4-40GB) was restarted this morning
the "old" IP before was 192.168.50.98
now it is 192.168.50.93
but it is still expected to be on .98
pod coredns is in an error CrashLoopBack
e.g. Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host

leads e.g. to fails when starting a notebook in RHOAI

2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull

Description

ImagePullBackOff error is observed on test-2-nerc.

Tried out different applications - but still the same error is observed. The application is trying to run on wrk-5 node.

bharathappali commented 2 weeks ago

I feel the issue is with the network as I couldn't see logs of the pods which are running on wrk-5 node

[abharath@abharath-thinkpadt14sgen2i ~]$ oc logs -f nvidia-mig-manager-22t4c -n nvidia-gpu-operator
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/nvidia-gpu-operator/nvidia-mig-manager-22t4c/nvidia-mig-manager?follow=true": dial tcp 192.168.50.98:10250: connect: no route to host

schwesig commented 2 weeks ago

/CC @tssala23 @dystewart

schwesig commented 2 weeks ago

Node Affected: wrk-5 on the test-2 Kruize cluster.
Issue: coredns pod on wrk-5 is in CrashLoopBackOff, causing DNS failures.
Current Status: coredns-wrk-5 shows 1/2 readiness, with 2155 restarts within 19 hours.
Network problems on wrk-5 prevent pulling container images and connecting to prometheus-k8s for metrics.
Prometheus connectivity error shows a Connection timed out on prometheus-k8s.openshift-monitoring.svc.cluster.local:9091.
Prior Issues: GPU allocation issues were noted on wrk-5 before the DNS problem.
Last Successful Run: The node last worked successfully before a recent restart (earlier today.
Maybe Trigger: Issues with GPU allocation were initially reported, followed by a node restart, after which network and DNS issues began affecting image pulls and internal connectivity.

schwesig commented 2 weeks ago

2024-11-07T11:38:15.000Z [Normal] Pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:38:02.000Z [Warning] Error: ImagePullBackOff
2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull
2024-11-07T11:37:48.000Z [Warning] Failed to pull image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1": rpc error: code = DeadlineExceeded desc = pinging container registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": dial tcp 172.30.21.204:5000: i/o timeout
2024-11-07T11:36:47.000Z [Normal] Started container oauth-proxy
2024-11-07T11:36:47.000Z [Normal] Created container oauth-proxy
2024-11-07T11:36:46.000Z [Normal] Successfully pulled image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46" in 16.491s (16.491s including waiting)
2024-11-07T11:36:29.000Z [Normal] Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46"
2024-11-07T11:35:29.000Z [Normal] Add eth0 [10.128.4.86/23] from ovn-kubernetes
2024-11-07T11:34:38.000Z [Normal] AttachVolume.Attach succeeded for volume "pvc-abea051c-53a6-43e8-8792-b8330bc9ea6d"
2024-11-07T11:34:37.806Z [Normal] Successfully assigned rhods-notebooks/jupyter-nb-schwesig-0 to wrk-5
Server requested

schwesig commented 2 weeks ago

❯ oc debug node/wrk-5
Starting pod/wrk-5-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.98
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: dial tcp 192.168.50.98:10250: connect: no route to host

schwesig commented 2 weeks ago

❯ oc logs coredns-wrk-5
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host

schwesig commented 2 weeks ago

❯ oc events -w
LAST SEEN                 TYPE      REASON       OBJECT              MESSAGE
84s (x12799 over 7d19h)   Warning   ProbeError   Pod/coredns-wrk-5   Liveness probe error: Get "http://192.168.50.98:18080/health": dial tcp 192.168.50.98:18080: connect: no route to host
body:
6m42s (x31518 over 7d19h)   Warning   BackOff      Pod/coredns-wrk-5   Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
5m24s (x2233 over 7d19h)    Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
155m (x14 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed: command timed out
^N1s (x31548 over 7d19h)      Warning   BackOff      Pod/coredns-wrk-5      Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
0s                          Normal    Pulled       Pod/wrk-5-debug        Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d6201c776053346ebce8f90c34797a7a7c05898008e17f3ba9673f5f14507b0" already present on machine
0s                          Normal    Created      Pod/wrk-5-debug        Created container container-00
0s                          Normal    Started      Pod/wrk-5-debug        Started container container-00
0s                          Normal    Killing      Pod/wrk-5-debug        Stopping container container-00
0s (x2234 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
0s (x2235 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:

schwesig commented 2 weeks ago

trying to ping from wrk-4, not possible

sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)

sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)

sh-5.1# ssh -v core@192.168.50.98
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 192.168.50.98 [192.168.50.98] port 22.
debug1: connect to address 192.168.50.98 port 22: No route to host
ssh: connect to host 192.168.50.98 port 22: No route to host

schwesig commented 2 weeks ago


sh-5.1# ip route
default via 192.168.50.1 dev br-ex proto dhcp src 192.168.50.149 metric 48 
default via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
default via 10.85.0.10 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.0.120.0/22 dev eno2 proto kernel scope link src 10.0.123.127 metric 102 
10.30.9.0/24 via 10.85.0.1 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.85.0.0/22 dev vlan2076 proto kernel scope link src 10.85.2.117 metric 400 
10.128.0.0/14 via 10.129.2.1 dev ovn-k8s-mp0 
10.129.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.129.2.2 
10.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
10.255.116.0/23 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
140.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
169.254.169.1 dev br-ex src 192.168.50.149 
169.254.169.3 via 10.129.2.1 dev ovn-k8s-mp0 
169.254.169.254 via 192.168.50.11 dev br-ex proto dhcp src 192.168.50.149 metric 48 
169.254.169.254 via 10.0.121.2 dev eno2 proto dhcp src 10.0.123.127 metric 102 
172.30.0.0/16 via 169.254.169.4 dev br-ex src 169.254.169.2 mtu 1400 
192.168.50.0/24 dev br-ex proto kernel scope link src 192.168.50.149 metric 48

schwesig commented 2 weeks ago

❯ oc debug node/ctl-1
Starting pod/ctl-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.114
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ping 192.168.50.98 
PING 192.168.50.98 (192.168.50.98) 56(84) bytes of data.
From 192.168.50.114 icmp_seq=1 Destination Host Unreachable
From 192.168.50.114 icmp_seq=2 Destination Host Unreachable
From 192.168.50.114 icmp_seq=3 Destination Host Unreachable
From 192.168.50.114 icmp_seq=4 Destination Host Unreachable
From 192.168.50.114 icmp_seq=5 Destination Host Unreachable
From 192.168.50.114 icmp_seq=6 Destination Host Unreachable
From 192.168.50.114 icmp_seq=7 Destination Host Unreachable
From 192.168.50.114 icmp_seq=8 Destination Host Unreachable
From 192.168.50.114 icmp_seq=12 Destination Host Unreachable
From 192.168.50.114 icmp_seq=15 Destination Host Unreachable
From 192.168.50.114 icmp_seq=16 Destination Host Unreachable
From 192.168.50.114 icmp_seq=18 Destination Host Unreachable
From 192.168.50.114 icmp_seq=21 Destination Host Unreachable
From 192.168.50.114 icmp_seq=22 Destination Host Unreachable
From 192.168.50.114 icmp_seq=24 Destination Host Unreachable
From 192.168.50.114 icmp_seq=25 Destination Host Unreachable
From 192.168.50.114 icmp_seq=26 Destination Host Unreachable
From 192.168.50.114 icmp_seq=27 Destination Host Unreachable
From 192.168.50.114 icmp_seq=28 Destination Host Unreachable
From 192.168.50.114 icmp_seq=29 Destination Host Unreachable
From 192.168.50.114 icmp_seq=30 Destination Host Unreachable
From 192.168.50.114 icmp_seq=31 Destination Host Unreachable
^C
--- 192.168.50.98 ping statistics ---
32 packets transmitted, 0 received, +22 errors, 100% packet loss, time 31749ms
pipe 4

schwesig commented 2 weeks ago

wrk-5 says 192.168.50.98 but it is 192.168.50.93 in real

schwesig commented 2 weeks ago

 oc describe node/wrk-5 | grep .93
                    k8s.ovn.org/host-cidrs: ["10.0.120.39/22","10.85.3.145/22","192.168.50.93/24"]
                      {"default":{"mode":"local","interface-id":"br-ex_wrk-5","mac-address":"08:8f:c3:a6:03:8e","ip-addresses":["192.168.50.93/24"],"ip-address"...
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.50.93/24"}
  System UUID:                                 0d3eb5fe-aba0-11ee-baa4-0a8fc3a60393

vs

❯ oc get node -o wide wrk-5
NAME    STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
wrk-5   Ready    worker   8d    v1.29.5+29c95f3   192.168.50.98   <none>        Red Hat Enterprise Linux CoreOS 416.94.202406172220-0   5.14.0-427.22.1.el9_4.x86_64   cri-o://1.29.5-5.rhaos4.16.git7032128.el9

schwesig commented 1 week ago

Just tagging for easier finding /CC @jtriley @larsks @tzumainn

tzumainn commented 1 week ago

I'm actually not sure where the 192.168.50.98 came from in the first place. The Neutron port associated with MOC-R8PAC23U39 has 192.168.50.93 as its IP address; it was created on October 29th and hasn't been updated since, so I don't think it's been modified. And I don't see any port in the inventory that has the 192.168.50.98 IP. So I'm guessing that configuration came from outside of ESI?

In any case - I'm not that familiar with OpenShift configuration, but is it possible to just update the worker IP? Failing that, I could update the IP address of the port in ESI (and maybe reboot the machine). Let me know!

larsks commented 1 week ago

@schwesig I rebooted wrk-5 this morning, which caused it to re-register with the cluster. This resulted in a small number of pending certificate signing requests:

$ k get csr
NAME        AGE    SIGNERNAME                            REQUESTOR               REQUESTEDDURATION   CONDITION
csr-9zbkw   45m    kubernetes.io/kube-apiserver-client   system:multus:ctl-0     24h                 Approved,Issued
csr-btxhn   72m    kubernetes.io/kube-apiserver-client   system:multus:wrk-3     24h                 Approved,Issued
csr-dhxvc   102m   kubernetes.io/kube-apiserver-client   system:ovn-node:wrk-1   24h                 Approved,Issued
csr-l6l2g   40m    kubernetes.io/kube-apiserver-client   system:ovn-node:ctl-2   24h                 Approved,Issued
csr-q6jxs   8s     kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-rp8wg   15m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-tplw4   16m    kubernetes.io/kube-apiserver-client   system:multus:wrk-0     24h                 Approved,Issued
csr-v592h   30m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending

After approving these requests (oc adm certificate approve ...), the node seems healthy and I am able to successfully schedule pods on the node.

larsks commented 1 week ago

There appears to be some issue with PV access on wrk-5. While the pods start and volumes mount successfully, actually writing to those volumes seems to block indefinitely.

hpdempsey commented 1 week ago

Should we replace wrk-5 with a different server?

larsks commented 1 week ago

@hpdempsey I don't think we have a hardware problem, but if someone else wants to give that a shot they should feel free to have at it. Simply removing the node from the cluster and re-adding it should accomplish the same thing. Since this is a test environment, it seems like a good opportunity to figure out what's going on so that we understand better next time.

From my perspective it looks more like a networking issue.

larsks commented 1 week ago

The node is logging this every few seconds:

Nov 08 23:08:56 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
Nov 08 23:10:59 wrk-5 kernel: rbd: rbd1: encountered watch error: -107
Nov 08 23:11:02 wrk-5 kernel: rbd: rbd0: encountered watch error: -107

tssala23 commented 1 week ago

@larsks reading through the thread seems like easiest solution would be to remove the node from the cluster and re add it, @schwesig am I clear to do that now?

schwesig commented 1 week ago

@tssala23 , yes please, proceed.

tssala23 commented 1 week ago

@schwesig I have removed and added it back, you can check if the problem is still happening

schwesig commented 1 week ago

@tssala23 thanks, wikll check

schwesig commented 1 week ago

I was able to start a new notebook. kruize is informed and will do their tests.

schwesig commented 1 week ago

i can close this issue. the core DNS problem is solved. correct IP etc

schwesig commented 1 week ago

schwesig commented 1 week ago