kubeovn / kube-ovn

A Bridge between SDN and Cloud Native (Project under CNCF)
https://kubeovn.github.io/docs/stable/en/
Apache License 2.0
1.96k stars 442 forks source link

[BUG] gonetworkmanager.NewNetworkManager() get stuck on the boot #4643

Open kvaps opened 5 days ago

kvaps commented 5 days ago

Kube-OVN Version

v1.12.19

Kubernetes Version

v1.30.3

Operation-system/Kernel Version

OS-IMAGE         KERNEL-VERSION
Talos (v1.8.1)   6.6.54-talos

Description

Hey I faced that some issue that kube-ovn get stuck on the boot. I spent a few hours for debugging and just wanted to report an issue.

I found out that it stucks on calling this function: https://github.com/kubeovn/kube-ovn/blob/c414e6b967a3b985365f2e546b2afbd51d258f0c/pkg/daemon/nm_linux.go#L34

Steps To Reproduce

Someone may face with this on Talos Linux

Current Behavior

kube-ovn-cni pods do not become to ready state, due to readiness probe, logs say:

I1021 21:17:51.672195    7025 ovs_linux.go:591] wait ovn0 gw ready
I1021 21:17:51.675748    7025 ovs.go:50] 100.64.0.3 network ready after 1 ping, gw 100.64.0.1

Expected Behavior

I expect message:

I1021 21:17:51.788534    7025 cniserver.go:83] start daemon controller

and kube-ovn-cni become to ready

dosubot[bot] commented 5 days ago

Hey @kvaps! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this issue together!

The issue you're experiencing with gonetworkmanager.NewNetworkManager() getting stuck during boot on Talos Linux could be related to dependencies such as the DBus connection or the NetworkManager service. The function relies on a DBus connection to communicate with the NetworkManager service, and if there are issues with the DBus service or if the NetworkManager service is not running, it could cause the function to get stuck [1].

Additionally, the readiness probe for kube-ovn-cni pods checks the operational state of the ovn0 network interface and network connectivity to the gateway. If these checks fail, the pods will not reach a ready state [2].

To address this, ensure that the NetworkManager service is running and that there are no issues with the DBus connection on your Talos Linux setup. You might also want to verify the configuration specific to Talos Linux, such as declaring the openvswitch module and using the correct Helm chart options [3].

To continue talking to Dosu, mention @dosu.

zhangzujian commented 3 days ago

Could you please provide the Talos os image?

kvaps commented 2 days ago

Sure, we tested it with: