Open dceara opened 1 month ago
looks like it might be related to recent changes that were done we should fix this before getting the ds merge done
It looks to me like this is fixed by https://github.com/ovn-org/ovn-kubernetes/pull/4652
I deleted ovnk pods multiple times and not seeing the issue. Feel free to reopen if it happens again.
I just tried on master:
# git log
commit 24108b821289b9b7ae410a9dffee8b1fcabbb24a (HEAD -> master, origin/master, origin/HEAD)
Merge: 1179e4d58 9baca6621
Author: Tim Rozet <trozet@redhat.com>
Date: Tue Aug 27 12:04:19 2024 -0400
Merge pull request #4652 from trozet/serialize_NAD_startup
Serializes Network Manager Start up
And I get the same crash.
I started kind with:
./kind.sh -ds -ic -mne -nse
Then I deleted the ovnkube-node pod corresponding to ovn-worker:
# ovnk=$(oc get pod -n ovn-kubernetes -o wide | grep ovnkube-node | grep 'ovn-worker ' | awk '{print $1}')
# oc delete pod -n ovn-kubernetes $ovnk
ovnkube fails in the same way:
F0827 18:41:40.589014 3240 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: unable to add gateway IP route for subnet: 10.96.0.0/16, route manager: failed to add route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 0 Realm: 0}): failed to apply route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 254 Realm: 0}): failed to add route (gw: 169.254.0.4, subnet 10.96.0.0/16, mtu 1400, src IP 169.254.0.2): file exists
I'm not sure it's relevant but I'm using podman on that machine.
I couldn't replicate the failure on main. Using docker.
I couldn't replicate the failure on main. Using docker.
I couldn't replicate the failure on main with docker either. Originally I was using podman, will try again.
@martinkennelly I moved back to using podman (just removed docker and installed podman and podman-docker) and now I get the same crash loop when deleting the ovnkube-node pod.
What happened?
On a freshly started kind cluster (multi-network and network segmentation enabled):
Delete an ovnkube-node pod:
The new ovnkube-node pod fails and crash loops because it fails to start the node network controller:
Logs of the ovnkube-node pod (full logs attached):
ovnk-logs.txt
What did you expect to happen?
The new ovnkube-node pod should come up without issues.
How can we reproduce it (as minimally and precisely as possible)?
Described above.
Anything else we need to know?
No response
OVN-Kubernetes version
Kubernetes version
OVN version
OVS version
Platform
OS version
Install tools
Container runtime (CRI) and version (if applicable)