nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

bug: kruize: ImagePullBackOff error on test-2-nerc #804

Closed kusumachalasani closed 1 week ago

kusumachalasani commented 2 weeks ago

Description

ImagePullBackOff error is observed on test-2-nerc.

Tried out different applications - but still the same error is observed. The application is trying to run on wrk-5 node.

Image

bharathappali commented 2 weeks ago

I feel the issue is with the network as I couldn't see logs of the pods which are running on wrk-5 node

[abharath@abharath-thinkpadt14sgen2i ~]$ oc logs -f nvidia-mig-manager-22t4c -n nvidia-gpu-operator
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/nvidia-gpu-operator/nvidia-mig-manager-22t4c/nvidia-mig-manager?follow=true": dial tcp 192.168.50.98:10250: connect: no route to host
schwesig commented 2 weeks ago

/CC @tssala23 @dystewart

schwesig commented 2 weeks ago
schwesig commented 2 weeks ago
2024-11-07T11:38:15.000Z [Normal] Pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:38:02.000Z [Warning] Error: ImagePullBackOff
2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull
2024-11-07T11:37:48.000Z [Warning] Failed to pull image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1": rpc error: code = DeadlineExceeded desc = pinging container registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": dial tcp 172.30.21.204:5000: i/o timeout
2024-11-07T11:36:47.000Z [Normal] Started container oauth-proxy
2024-11-07T11:36:47.000Z [Normal] Created container oauth-proxy
2024-11-07T11:36:46.000Z [Normal] Successfully pulled image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46" in 16.491s (16.491s including waiting)
2024-11-07T11:36:29.000Z [Normal] Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46"
2024-11-07T11:35:29.000Z [Normal] Add eth0 [10.128.4.86/23] from ovn-kubernetes
2024-11-07T11:34:38.000Z [Normal] AttachVolume.Attach succeeded for volume "pvc-abea051c-53a6-43e8-8792-b8330bc9ea6d"
2024-11-07T11:34:37.806Z [Normal] Successfully assigned rhods-notebooks/jupyter-nb-schwesig-0 to wrk-5
Server requested
schwesig commented 2 weeks ago
❯ oc debug node/wrk-5
Starting pod/wrk-5-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.98
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: dial tcp 192.168.50.98:10250: connect: no route to host
schwesig commented 2 weeks ago
❯ oc logs coredns-wrk-5
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host
schwesig commented 2 weeks ago
❯ oc events -w
LAST SEEN                 TYPE      REASON       OBJECT              MESSAGE
84s (x12799 over 7d19h)   Warning   ProbeError   Pod/coredns-wrk-5   Liveness probe error: Get "http://192.168.50.98:18080/health": dial tcp 192.168.50.98:18080: connect: no route to host
body:
6m42s (x31518 over 7d19h)   Warning   BackOff      Pod/coredns-wrk-5   Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
5m24s (x2233 over 7d19h)    Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
155m (x14 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed: command timed out
^N1s (x31548 over 7d19h)      Warning   BackOff      Pod/coredns-wrk-5      Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
0s                          Normal    Pulled       Pod/wrk-5-debug        Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d6201c776053346ebce8f90c34797a7a7c05898008e17f3ba9673f5f14507b0" already present on machine
0s                          Normal    Created      Pod/wrk-5-debug        Created container container-00
0s                          Normal    Started      Pod/wrk-5-debug        Started container container-00
0s                          Normal    Killing      Pod/wrk-5-debug        Stopping container container-00
0s (x2234 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
0s (x2235 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
schwesig commented 2 weeks ago

trying to ping from wrk-4, not possible

sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# ssh -v core@192.168.50.98
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 192.168.50.98 [192.168.50.98] port 22.
debug1: connect to address 192.168.50.98 port 22: No route to host
ssh: connect to host 192.168.50.98 port 22: No route to host
schwesig commented 2 weeks ago

sh-5.1# ip route
default via 192.168.50.1 dev br-ex proto dhcp src 192.168.50.149 metric 48 
default via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
default via 10.85.0.10 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.0.120.0/22 dev eno2 proto kernel scope link src 10.0.123.127 metric 102 
10.30.9.0/24 via 10.85.0.1 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.85.0.0/22 dev vlan2076 proto kernel scope link src 10.85.2.117 metric 400 
10.128.0.0/14 via 10.129.2.1 dev ovn-k8s-mp0 
10.129.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.129.2.2 
10.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
10.255.116.0/23 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
140.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
169.254.169.1 dev br-ex src 192.168.50.149 
169.254.169.3 via 10.129.2.1 dev ovn-k8s-mp0 
169.254.169.254 via 192.168.50.11 dev br-ex proto dhcp src 192.168.50.149 metric 48 
169.254.169.254 via 10.0.121.2 dev eno2 proto dhcp src 10.0.123.127 metric 102 
172.30.0.0/16 via 169.254.169.4 dev br-ex src 169.254.169.2 mtu 1400 
192.168.50.0/24 dev br-ex proto kernel scope link src 192.168.50.149 metric 48 
schwesig commented 2 weeks ago
❯ oc debug node/ctl-1
Starting pod/ctl-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.114
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ping 192.168.50.98 
PING 192.168.50.98 (192.168.50.98) 56(84) bytes of data.
From 192.168.50.114 icmp_seq=1 Destination Host Unreachable
From 192.168.50.114 icmp_seq=2 Destination Host Unreachable
From 192.168.50.114 icmp_seq=3 Destination Host Unreachable
From 192.168.50.114 icmp_seq=4 Destination Host Unreachable
From 192.168.50.114 icmp_seq=5 Destination Host Unreachable
From 192.168.50.114 icmp_seq=6 Destination Host Unreachable
From 192.168.50.114 icmp_seq=7 Destination Host Unreachable
From 192.168.50.114 icmp_seq=8 Destination Host Unreachable
From 192.168.50.114 icmp_seq=12 Destination Host Unreachable
From 192.168.50.114 icmp_seq=15 Destination Host Unreachable
From 192.168.50.114 icmp_seq=16 Destination Host Unreachable
From 192.168.50.114 icmp_seq=18 Destination Host Unreachable
From 192.168.50.114 icmp_seq=21 Destination Host Unreachable
From 192.168.50.114 icmp_seq=22 Destination Host Unreachable
From 192.168.50.114 icmp_seq=24 Destination Host Unreachable
From 192.168.50.114 icmp_seq=25 Destination Host Unreachable
From 192.168.50.114 icmp_seq=26 Destination Host Unreachable
From 192.168.50.114 icmp_seq=27 Destination Host Unreachable
From 192.168.50.114 icmp_seq=28 Destination Host Unreachable
From 192.168.50.114 icmp_seq=29 Destination Host Unreachable
From 192.168.50.114 icmp_seq=30 Destination Host Unreachable
From 192.168.50.114 icmp_seq=31 Destination Host Unreachable
^C
--- 192.168.50.98 ping statistics ---
32 packets transmitted, 0 received, +22 errors, 100% packet loss, time 31749ms
pipe 4
schwesig commented 2 weeks ago

wrk-5 says 192.168.50.98 but it is 192.168.50.93 in real

schwesig commented 2 weeks ago
 oc describe node/wrk-5 | grep .93
                    k8s.ovn.org/host-cidrs: ["10.0.120.39/22","10.85.3.145/22","192.168.50.93/24"]
                      {"default":{"mode":"local","interface-id":"br-ex_wrk-5","mac-address":"08:8f:c3:a6:03:8e","ip-addresses":["192.168.50.93/24"],"ip-address"...
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.50.93/24"}
  System UUID:                                 0d3eb5fe-aba0-11ee-baa4-0a8fc3a60393

vs

❯ oc get node -o wide wrk-5
NAME    STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
wrk-5   Ready    worker   8d    v1.29.5+29c95f3   192.168.50.98   <none>        Red Hat Enterprise Linux CoreOS 416.94.202406172220-0   5.14.0-427.22.1.el9_4.x86_64   cri-o://1.29.5-5.rhaos4.16.git7032128.el9
schwesig commented 1 week ago

Just tagging for easier finding /CC @jtriley @larsks @tzumainn

tzumainn commented 1 week ago

I'm actually not sure where the 192.168.50.98 came from in the first place. The Neutron port associated with MOC-R8PAC23U39 has 192.168.50.93 as its IP address; it was created on October 29th and hasn't been updated since, so I don't think it's been modified. And I don't see any port in the inventory that has the 192.168.50.98 IP. So I'm guessing that configuration came from outside of ESI?

In any case - I'm not that familiar with OpenShift configuration, but is it possible to just update the worker IP? Failing that, I could update the IP address of the port in ESI (and maybe reboot the machine). Let me know!

larsks commented 1 week ago

@schwesig I rebooted wrk-5 this morning, which caused it to re-register with the cluster. This resulted in a small number of pending certificate signing requests:

$ k get csr
NAME        AGE    SIGNERNAME                            REQUESTOR               REQUESTEDDURATION   CONDITION
csr-9zbkw   45m    kubernetes.io/kube-apiserver-client   system:multus:ctl-0     24h                 Approved,Issued
csr-btxhn   72m    kubernetes.io/kube-apiserver-client   system:multus:wrk-3     24h                 Approved,Issued
csr-dhxvc   102m   kubernetes.io/kube-apiserver-client   system:ovn-node:wrk-1   24h                 Approved,Issued
csr-l6l2g   40m    kubernetes.io/kube-apiserver-client   system:ovn-node:ctl-2   24h                 Approved,Issued
csr-q6jxs   8s     kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-rp8wg   15m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-tplw4   16m    kubernetes.io/kube-apiserver-client   system:multus:wrk-0     24h                 Approved,Issued
csr-v592h   30m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending

After approving these requests (oc adm certificate approve ...), the node seems healthy and I am able to successfully schedule pods on the node.

larsks commented 1 week ago

There appears to be some issue with PV access on wrk-5. While the pods start and volumes mount successfully, actually writing to those volumes seems to block indefinitely.

hpdempsey commented 1 week ago

Should we replace wrk-5 with a different server?

larsks commented 1 week ago

@hpdempsey I don't think we have a hardware problem, but if someone else wants to give that a shot they should feel free to have at it. Simply removing the node from the cluster and re-adding it should accomplish the same thing. Since this is a test environment, it seems like a good opportunity to figure out what's going on so that we understand better next time.

From my perspective it looks more like a networking issue.

larsks commented 1 week ago

The node is logging this every few seconds:

Nov 08 23:08:56 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
Nov 08 23:10:59 wrk-5 kernel: rbd: rbd1: encountered watch error: -107
Nov 08 23:11:02 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
tssala23 commented 1 week ago

@larsks reading through the thread seems like easiest solution would be to remove the node from the cluster and re add it, @schwesig am I clear to do that now?

schwesig commented 1 week ago

@tssala23 , yes please, proceed.

tssala23 commented 1 week ago

@schwesig I have removed and added it back, you can check if the problem is still happening

schwesig commented 1 week ago

@tssala23 thanks, wikll check

schwesig commented 1 week ago

I was able to start a new notebook. kruize is informed and will do their tests.

schwesig commented 1 week ago

i can close this issue. the core DNS problem is solved. correct IP etc

schwesig commented 1 week ago

Image

schwesig commented 1 week ago

Image

Image