Closed DanielFroehlich closed 3 months ago
Same issue when creating a new VM with the virt launcher pod, example in ns "stormshift-microshift"
Looks like we have a generall problem with ucs56/57 nodes:
oc get pods -A -o wide | grep -v Completed | grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openshift-cnv centos-7-image-cron-7a375378-28660012-xlf6s 0/1 ContainerCreating 0 3d13h <none> ucs57 <none> <none>
openshift-cnv centos-stream8-image-cron-2da55196-28660012-bccdn 0/1 ContainerCreating 0 3d13h <none> ucs57 <none> <none>
openshift-cnv centos-stream9-image-cron-3832a6ff-28660012-8bbtp 0/1 ContainerCreating 0 3d13h <none> ucs57 <none> <none>
openshift-cnv fedora-image-cron-2336cc39-28660012-5rkn7 0/1 ContainerCreating 0 3d13h <none> ucs57 <none> <none>
openshift-image-registry image-pruner-28660320-82h68 0/1 ContainerCreating 0 3d8h <none> ucs57 <none> <none>
openshift-marketplace acm-custom-registry-bn72q 0/1 ContainerCreating 0 3d16h <none> ucs57 <none> <none>
openshift-marketplace multiclusterengine-catalog-cqrdr 0/1 ContainerCreating 0 3d17h <none> ucs57 <none> <none>
openshift-pipelines tekton-resource-pruner-r27k7-28660800-28xb6 0/1 ContainerCreating 0 3d <none> ucs57 <none> <none>
rbohne-hcp-rhods certified-operators-catalog-8d57f86d6-2fktc 0/1 ContainerCreating 0 5h37m <none> ucs56 <none> <none>
rbohne-hcp-rhods community-operators-catalog-b4c8fddf8-4fqws 0/1 ContainerCreating 0 3h7m <none> ucs56 <none> <none>
rbohne-hcp-rhods importer-prime-aae4a260-a506-4616-921f-78c117be02a0 0/2 Init:0/1 0 13m <none> ucs57 <none> <none>
rbohne-hcp-rhods importer-prime-abbd9cb3-d101-4a90-93fa-97b4bc0280d5 0/2 Init:0/1 0 15m <none> ucs57 <none> <none>
rbohne-hcp-rhods olm-collect-profiles-28660527-l2rn8 0/1 ContainerCreating 0 3d4h <none> ucs57 <none> <none>
rbohne-hcp-rhods olm-collect-profiles-28661967-xb9fx 0/1 ContainerCreating 0 2d4h <none> ucs57 <none> <none>
rbohne-hcp-rhods olm-collect-profiles-28663407-7ng7c 0/1 ContainerCreating 0 28h <none> ucs57 <none> <none>
rbohne-hcp-rhods olm-collect-profiles-28664847-8dbhq 0/1 ContainerCreating 0 4h50m <none> ucs57 <none> <none>
rbohne-hcp-rhods redhat-marketplace-catalog-7977bb8dd7-t8bzj 0/1 ContainerCreating 0 11h <none> ucs56 <none> <none>
rbohne-hcp-rhods redhat-operators-catalog-6f6575d9c4-l7lq5 0/1 ContainerCreating 0 10h <none> ucs56 <none> <none>
rbohne-hcp-sendling-ingress certified-operators-catalog-75fbf8f964-rwgq7 0/1 ContainerCreating 0 5h26m <none> ucs56 <none> <none>
rbohne-hcp-sendling-ingress community-operators-catalog-6d5c96fdd8-lgcgn 0/1 ContainerCreating 0 176m <none> ucs56 <none> <none>
rbohne-hcp-sendling-ingress olm-collect-profiles-28660503-s4pch 0/1 ContainerCreating 0 3d5h <none> ucs57 <none> <none>
rbohne-hcp-sendling-ingress olm-collect-profiles-28661943-9wj98 0/1 ContainerCreating 0 2d5h <none> ucs57 <none> <none>
rbohne-hcp-sendling-ingress olm-collect-profiles-28663383-jqjhr 0/1 ContainerCreating 0 29h <none> ucs57 <none> <none>
rbohne-hcp-sendling-ingress olm-collect-profiles-28664823-svrlq 0/1 ContainerCreating 0 5h14m <none> ucs57 <none> <none>
rbohne-hcp-sendling-ingress redhat-marketplace-catalog-769b96bb8c-ldzzx 0/1 ContainerCreating 0 11h <none> ucs56 <none> <none>
rbohne-hcp-sendling-ingress redhat-operators-catalog-6ffbd47bb6-7l9vc 0/1 ContainerCreating 0 9h <none> ucs56 <none> <none>
stormshift-microshift virt-launcher-ushift08-cw2ww 0/1 ContainerCreating 0 19h <none> ucs57 <none> 1/1
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-ushift08-cw2ww_stormshift-microshift_6e292806-0748-46dc-b90e-8b8767e0c409_0(12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79): error adding pod stormshift-microshift_virt-launcher-ushift08-cw2ww to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79 Netns:/var/run/netns/4fa0ce5f-0c5f-447e-826d-30ddd55763f7 IfName:eth0
=> https://access.redhat.com/solutions/7042208 old KSC did not help...
Let's try:
https://hackmd.io/@mjace/H1fJuv5Ap?utm_source=preview-mode&utm_medium=rec
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ovnkube-control-plane-6c569d8d4b-5fc4q 2/2 Running 1 (36d ago) 73d 10.32.96.5 inf5 <none> <none>
ovnkube-control-plane-6c569d8d4b-df5n9 2/2 Running 0 73d 10.32.96.4 inf4 <none> <none>
ovnkube-node-5fnmg 8/8 Running 17 73d 10.32.96.8 inf8 <none> <none>
ovnkube-node-bf9kn 8/8 Running 9 (73d ago) 73d 10.32.96.4 inf4 <none> <none>
ovnkube-node-kftrz 8/8 Running 9 (73d ago) 73d 10.32.96.6 inf6 <none> <none>
ovnkube-node-nb8fs 8/8 Running 8 73d 10.32.96.44 inf44 <none> <none>
ovnkube-node-tx28h 8/8 Running 8 73d 10.32.96.57 ucs57 <none> <none>
ovnkube-node-vkdv2 8/8 Running 9 (73d ago) 73d 10.32.96.5 inf5 <none> <none>
ovnkube-node-vn27s 8/8 Running 16 73d 10.32.96.7 inf7 <none> <none>
ovnkube-node-wzfnp 8/8 Running 8 73d 10.32.96.56 ucs56 <none> <none>
$ oc delete pods ovnkube-control-plane-6c569d8d4b-5fc4q ovnkube-control-plane-6c569d8d4b-df5n9 ovnkube-node-tx28h ovnkube-node-wzfnp
pod "ovnkube-control-plane-6c569d8d4b-5fc4q" deleted
pod "ovnkube-control-plane-6c569d8d4b-df5n9" deleted
pod "ovnkube-node-tx28h" deleted
pod "ovnkube-node-wzfnp" deleted
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ovnkube-control-plane-6c569d8d4b-fxvqd 2/2 Running 0 61s 10.32.96.6 inf6 <none> <none>
ovnkube-control-plane-6c569d8d4b-j2j6g 2/2 Running 0 61s 10.32.96.5 inf5 <none> <none>
ovnkube-node-5fnmg 8/8 Running 17 73d 10.32.96.8 inf8 <none> <none>
ovnkube-node-bf9kn 8/8 Running 9 (73d ago) 73d 10.32.96.4 inf4 <none> <none>
ovnkube-node-dzgff 8/8 Running 0 30s 10.32.96.56 ucs56 <none> <none>
ovnkube-node-kftrz 8/8 Running 9 (73d ago) 73d 10.32.96.6 inf6 <none> <none>
ovnkube-node-nb8fs 8/8 Running 8 73d 10.32.96.44 inf44 <none> <none>
ovnkube-node-vdvmf 8/8 Running 0 30s 10.32.96.57 ucs57 <none> <none>
ovnkube-node-vkdv2 8/8 Running 9 (73d ago) 73d 10.32.96.5 inf5 <none> <none>
ovnkube-node-vn27s 8/8 Running 16 73d 10.32.96.7 inf7 <none> <none>
Solved
$ oc get pods -A -o wide | grep -v Completed | grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
rbohne-hcp-rhods virt-launcher-rhods-4e9414fe-qdpg2-mm5rs 0/1 ContainerCreating 0 4s <none> ucs56 <none> 1/1
rbohne-hcp-sendling-ingress olm-collect-profiles-28664823-svrlq 0/1 Error 0 5h37m 10.130.8.10 ucs57 <none> <none>
The pods are above from my hcp playground we can ingore for now.
Same problem again today with ucs56 - trying the workaround....
...by deleting the control plane pods AND the ovnkube-node pods on ucs56 and ucs57 Now the cluster is in a really strange state, console not working, API/Control plane degraded. @rbo , HELP! Please!
Feels like ovnk is in an inconsitent state. e.g this event in openshift-console when trying to restart the console:
"4m42s Warning ErrorUpdatingResource pod/downloads-54777dd798-vxmhz addLogicalPort failed for openshift-console/downloads-54777dd798-vxmhz: timed out waiting for logical switch in logical switch cache "ucs57" subnet: error getting logical switch ucs57: switch not in logical switch cache"
Trying to drain and reboot UCS56....
... that helped, cluster looks way better now. I needed also to disable/enable the CNV console plugin.
still wondering what the root cause is/was - might need to regularly reboot nodes? Closing for now.
I am trying to run job template "stormshift-update-template-vms" on ISAR AAP. The fails, the automation-job pod in NS "ansible-automation-platform" is stuck in state "ContainerCreating".
Event log shows error messages:
addLogicalPort failed for ansible-automation-platform/automation-job-252-jcjdd: failed to assign pod addresses for pod default/ansible-automation-platform/automation-job-252-jcjdd on switch: ucs57, err: range is full
and
failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_automation-job-252-jcjdd_ansible-automation-platform_fbcce7b4-1feb-48d0-8067-21a2f69ab074_0(000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d): error adding pod ansible-automation-platform_automation-job-252-jcjdd to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d Netns:/var/run/netns/b5ad18e1-7e53-4ac6-95ff-01737e7ae193 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=ansible-automation-platform;K8S_POD_NAME=automation-job-252-
@rbo , can you please advise?