stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

AAP failed/stuck job due to pod networking problem #187

Closed DanielFroehlich closed 1 month ago

DanielFroehlich commented 2 months ago

I am trying to run job template "stormshift-update-template-vms" on ISAR AAP. The fails, the automation-job pod in NS "ansible-automation-platform" is stuck in state "ContainerCreating".

Event log shows error messages:

addLogicalPort failed for ansible-automation-platform/automation-job-252-jcjdd: failed to assign pod addresses for pod default/ansible-automation-platform/automation-job-252-jcjdd on switch: ucs57, err: range is full

and

failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_automation-job-252-jcjdd_ansible-automation-platform_fbcce7b4-1feb-48d0-8067-21a2f69ab074_0(000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d): error adding pod ansible-automation-platform_automation-job-252-jcjdd to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d Netns:/var/run/netns/b5ad18e1-7e53-4ac6-95ff-01737e7ae193 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=ansible-automation-platform;K8S_POD_NAME=automation-job-252-

@rbo , can you please advise?

DanielFroehlich commented 2 months ago

Same issue when creating a new VM with the virt launcher pod, example in ns "stormshift-microshift"

https://console-openshift-console.apps.isar.coe.muc.redhat.com/k8s/ns/stormshift-microshift/pods/virt-launcher-ushift08-cw2ww

rbo commented 2 months ago

Looks like we have a generall problem with ucs56/57 nodes:

 oc get pods -A -o wide | grep -v Completed | grep -v Running 
NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS         AGE     IP              NODE    NOMINATED NODE   READINESS GATES
openshift-cnv                                      centos-7-image-cron-7a375378-28660012-xlf6s                       0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      centos-stream8-image-cron-2da55196-28660012-bccdn                 0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      centos-stream9-image-cron-3832a6ff-28660012-8bbtp                 0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      fedora-image-cron-2336cc39-28660012-5rkn7                         0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-image-registry                           image-pruner-28660320-82h68                                       0/1     ContainerCreating   0                3d8h    <none>          ucs57   <none>           <none>
openshift-marketplace                              acm-custom-registry-bn72q                                         0/1     ContainerCreating   0                3d16h   <none>          ucs57   <none>           <none>
openshift-marketplace                              multiclusterengine-catalog-cqrdr                                  0/1     ContainerCreating   0                3d17h   <none>          ucs57   <none>           <none>
openshift-pipelines                                tekton-resource-pruner-r27k7-28660800-28xb6                       0/1     ContainerCreating   0                3d      <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   certified-operators-catalog-8d57f86d6-2fktc                       0/1     ContainerCreating   0                5h37m   <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   community-operators-catalog-b4c8fddf8-4fqws                       0/1     ContainerCreating   0                3h7m    <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   importer-prime-aae4a260-a506-4616-921f-78c117be02a0               0/2     Init:0/1            0                13m     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   importer-prime-abbd9cb3-d101-4a90-93fa-97b4bc0280d5               0/2     Init:0/1            0                15m     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28660527-l2rn8                               0/1     ContainerCreating   0                3d4h    <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28661967-xb9fx                               0/1     ContainerCreating   0                2d4h    <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28663407-7ng7c                               0/1     ContainerCreating   0                28h     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28664847-8dbhq                               0/1     ContainerCreating   0                4h50m   <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   redhat-marketplace-catalog-7977bb8dd7-t8bzj                       0/1     ContainerCreating   0                11h     <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   redhat-operators-catalog-6f6575d9c4-l7lq5                         0/1     ContainerCreating   0                10h     <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        certified-operators-catalog-75fbf8f964-rwgq7                      0/1     ContainerCreating   0                5h26m   <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        community-operators-catalog-6d5c96fdd8-lgcgn                      0/1     ContainerCreating   0                176m    <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28660503-s4pch                               0/1     ContainerCreating   0                3d5h    <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28661943-9wj98                               0/1     ContainerCreating   0                2d5h    <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28663383-jqjhr                               0/1     ContainerCreating   0                29h     <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28664823-svrlq                               0/1     ContainerCreating   0                5h14m   <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        redhat-marketplace-catalog-769b96bb8c-ldzzx                       0/1     ContainerCreating   0                11h     <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        redhat-operators-catalog-6ffbd47bb6-7l9vc                         0/1     ContainerCreating   0                9h      <none>          ucs56   <none>           <none>
stormshift-microshift                              virt-launcher-ushift08-cw2ww                                      0/1     ContainerCreating   0                19h     <none>          ucs57   <none>           1/1
rbo commented 2 months ago
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-ushift08-cw2ww_stormshift-microshift_6e292806-0748-46dc-b90e-8b8767e0c409_0(12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79): error adding pod stormshift-microshift_virt-launcher-ushift08-cw2ww to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79 Netns:/var/run/netns/4fa0ce5f-0c5f-447e-826d-30ddd55763f7 IfName:eth0

=> https://access.redhat.com/solutions/7042208 old KSC did not help...

rbo commented 2 months ago

Mh https://github.com/k8snetworkplumbingwg/multus-cni/issues/1221

rbo commented 2 months ago

Let's try:

https://hackmd.io/@mjace/H1fJuv5Ap?utm_source=preview-mode&utm_medium=rec

$ oc get pods -o wide
NAME                                     READY   STATUS    RESTARTS      AGE   IP            NODE    NOMINATED NODE   READINESS GATES
ovnkube-control-plane-6c569d8d4b-5fc4q   2/2     Running   1 (36d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-control-plane-6c569d8d4b-df5n9   2/2     Running   0             73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-5fnmg                       8/8     Running   17            73d   10.32.96.8    inf8    <none>           <none>
ovnkube-node-bf9kn                       8/8     Running   9 (73d ago)   73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-kftrz                       8/8     Running   9 (73d ago)   73d   10.32.96.6    inf6    <none>           <none>
ovnkube-node-nb8fs                       8/8     Running   8             73d   10.32.96.44   inf44   <none>           <none>
ovnkube-node-tx28h                       8/8     Running   8             73d   10.32.96.57   ucs57   <none>           <none>
ovnkube-node-vkdv2                       8/8     Running   9 (73d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-node-vn27s                       8/8     Running   16            73d   10.32.96.7    inf7    <none>           <none>
ovnkube-node-wzfnp                       8/8     Running   8             73d   10.32.96.56   ucs56   <none>           <none>

$ oc delete pods ovnkube-control-plane-6c569d8d4b-5fc4q ovnkube-control-plane-6c569d8d4b-df5n9 ovnkube-node-tx28h ovnkube-node-wzfnp
pod "ovnkube-control-plane-6c569d8d4b-5fc4q" deleted
pod "ovnkube-control-plane-6c569d8d4b-df5n9" deleted
pod "ovnkube-node-tx28h" deleted
pod "ovnkube-node-wzfnp" deleted
$ oc get pods -o wide
NAME                                     READY   STATUS    RESTARTS      AGE   IP            NODE    NOMINATED NODE   READINESS GATES
ovnkube-control-plane-6c569d8d4b-fxvqd   2/2     Running   0             61s   10.32.96.6    inf6    <none>           <none>
ovnkube-control-plane-6c569d8d4b-j2j6g   2/2     Running   0             61s   10.32.96.5    inf5    <none>           <none>
ovnkube-node-5fnmg                       8/8     Running   17            73d   10.32.96.8    inf8    <none>           <none>
ovnkube-node-bf9kn                       8/8     Running   9 (73d ago)   73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-dzgff                       8/8     Running   0             30s   10.32.96.56   ucs56   <none>           <none>
ovnkube-node-kftrz                       8/8     Running   9 (73d ago)   73d   10.32.96.6    inf6    <none>           <none>
ovnkube-node-nb8fs                       8/8     Running   8             73d   10.32.96.44   inf44   <none>           <none>
ovnkube-node-vdvmf                       8/8     Running   0             30s   10.32.96.57   ucs57   <none>           <none>
ovnkube-node-vkdv2                       8/8     Running   9 (73d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-node-vn27s                       8/8     Running   16            73d   10.32.96.7    inf7    <none>           <none>
rbo commented 2 months ago

Solved

$ oc get pods -A -o wide | grep -v Completed | grep -v Running 
NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS         AGE     IP              NODE    NOMINATED NODE   READINESS GATES
rbohne-hcp-rhods                                   virt-launcher-rhods-4e9414fe-qdpg2-mm5rs                          0/1     ContainerCreating   0                4s      <none>          ucs56   <none>           1/1
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28664823-svrlq                               0/1     Error               0                5h37m   10.130.8.10     ucs57   <none>           <none>

The pods are above from my hcp playground we can ingore for now.

DanielFroehlich commented 1 month ago

Same problem again today with ucs56 - trying the workaround....

DanielFroehlich commented 1 month ago

...by deleting the control plane pods AND the ovnkube-node pods on ucs56 and ucs57 Now the cluster is in a really strange state, console not working, API/Control plane degraded. @rbo , HELP! Please!

DanielFroehlich commented 1 month ago

Feels like ovnk is in an inconsitent state. e.g this event in openshift-console when trying to restart the console:

"4m42s Warning ErrorUpdatingResource pod/downloads-54777dd798-vxmhz addLogicalPort failed for openshift-console/downloads-54777dd798-vxmhz: timed out waiting for logical switch in logical switch cache "ucs57" subnet: error getting logical switch ucs57: switch not in logical switch cache"

DanielFroehlich commented 1 month ago

Trying to drain and reboot UCS56....

DanielFroehlich commented 1 month ago

... that helped, cluster looks way better now. I needed also to disable/enable the CNV console plugin.

DanielFroehlich commented 1 month ago

still wondering what the root cause is/was - might need to regularly reboot nodes? Closing for now.