ovn-org / ovn-kubernetes

A robust Kubernetes networking platform
https://ovn-kubernetes.io/
Apache License 2.0
825 stars 347 forks source link

ovnkube-node crash loops when trying to restart #4654

Open dceara opened 1 month ago

dceara commented 1 month ago

What happened?

On a freshly started kind cluster (multi-network and network segmentation enabled):

$ ./kind.sh -ds -ic -mne -nse

Delete an ovnkube-node pod:

$ oc delete pod -n ovn-kubernetes $ovnk

The new ovnkube-node pod fails and crash loops because it fails to start the node network controller:

$ oc get pod -n ovn-kubernetes
NAME                                     READY   STATUS             RESTARTS      AGE
ovnkube-control-plane-589c64c694-p4bsw   1/1     Running            0             17h
ovnkube-identity-794d5bb9dd-9m74d        1/1     Running            0             17h
ovnkube-node-8kkdv                       6/6     Running            0             17h
ovnkube-node-nq2m7                       6/6     Running            0             17h
ovnkube-node-qvvcq                       5/6     CrashLoopBackOff   2 (23s ago)   3m28s
ovs-node-6fkft                           1/1     Running            0             17h
ovs-node-h4l5k                           1/1     Running            0             17h
ovs-node-qxqzz                           1/1     Running 

Logs of the ovnkube-node pod (full logs attached):

F0827 09:07:04.311697  375355 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: unable to add gateway IP route for subnet: 10.96.0.0/16, route manager: failed to add route ({Ifindex: 12 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 0 Realm: 0}): failed to apply route ({Ifindex: 12 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 254 Realm: 0}): failed to add route (gw: 169.254.0.4, subnet 10.96.0.0/16, mtu 1400, src IP 169.254.0.2): file exists

ovnk-logs.txt

What did you expect to happen?

The new ovnkube-node pod should come up without issues.

How can we reproduce it (as minimally and precisely as possible)?

Described above.

Anything else we need to know?

No response

OVN-Kubernetes version

```console $ ovnkube --version # paste output here ```

Kubernetes version

```console $ kubectl version # paste output here ```

OVN version

```console $ oc rsh -n ovn-kubernetes ovnkube-node-xxxxx (pick any ovnkube-node pod on your cluster) $ rpm -q ovn # paste output here ```

OVS version

```console $ oc rsh -n ovn-kubernetes ovs-node-xxxxx (pick any ovs pod on your cluster) $ rpm -q openvswitch # paste output here ```

Platform

Is it baremetal? GCP? AWS? Azure?

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

tssurya commented 1 month ago

looks like it might be related to recent changes that were done we should fix this before getting the ds merge done

trozet commented 1 month ago

It looks to me like this is fixed by https://github.com/ovn-org/ovn-kubernetes/pull/4652

I deleted ovnk pods multiple times and not seeing the issue. Feel free to reopen if it happens again.

dceara commented 1 month ago

I just tried on master:

# git log
commit 24108b821289b9b7ae410a9dffee8b1fcabbb24a (HEAD -> master, origin/master, origin/HEAD)
Merge: 1179e4d58 9baca6621
Author: Tim Rozet <trozet@redhat.com>
Date:   Tue Aug 27 12:04:19 2024 -0400

    Merge pull request #4652 from trozet/serialize_NAD_startup

    Serializes Network Manager Start up

And I get the same crash.

I started kind with:

./kind.sh -ds -ic -mne -nse

Then I deleted the ovnkube-node pod corresponding to ovn-worker:

# ovnk=$(oc get pod -n ovn-kubernetes -o wide | grep ovnkube-node | grep 'ovn-worker ' | awk '{print $1}')
# oc delete pod -n ovn-kubernetes $ovnk

ovnkube fails in the same way:

F0827 18:41:40.589014    3240 ovnkube.go:137] failed to run ovnkube: failed to start node network controller: failed to start default node network controller: unable to add gateway IP route for subnet: 10.96.0.0/16, route manager: failed to add route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 0 Realm: 0}): failed to apply route ({Ifindex: 9 Dst: 10.96.0.0/16 Src: 169.254.0.2 Gw: 169.254.0.4 Flags: [] Table: 254 Realm: 0}): failed to add route (gw: 169.254.0.4, subnet 10.96.0.0/16, mtu 1400, src IP 169.254.0.2): file exists

I'm not sure it's relevant but I'm using podman on that machine.

martinkennelly commented 1 month ago

I couldn't replicate the failure on main. Using docker.

dceara commented 1 month ago

I couldn't replicate the failure on main. Using docker.

I couldn't replicate the failure on main with docker either. Originally I was using podman, will try again.

dceara commented 1 month ago

@martinkennelly I moved back to using podman (just removed docker and installed podman and podman-docker) and now I get the same crash loop when deleting the ovnkube-node pod.