k8snetworkplumbingwg / sriov-cni

DPDK & SR-IOV CNI plugin
Apache License 2.0
307 stars 146 forks source link

pod start failed when node reboot #178

Closed arikachen closed 3 years ago

arikachen commented 3 years ago

What happened?

When node restart, pod is Error status. CNIADD failed with netdevice not exist and CNIDEL failed too. Events:

Normal  SandboxChanged  79s (x1057 over 3h56m)  kubelet  Pod sandbox changed, it will be killed and re-created.

What did you expect to happen?

pod will start success.

What are the minimal steps needed to reproduce the bug?

sriov with mlnx card reboot node

Component Versions

Please fill in the below table with the version numbers of applicable components used.

Component Version
SR-IOV CNI Plugin 4.9
Multus stable
SR-IOV Network Device Plugin 4.9
Kubernetes v1.19.8
OS 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux

Logs

SR-IOV Network Config Daemon Logs
E0412 03:17:29.008268    4361 daemon.go:824] tryEnableRdma(): fail to enable rdma exit status 1:
I0412 03:17:29.041798    4361 daemon.go:434] Set log verbose level to: 2
I0412 03:17:29.054061    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:29.054088    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:29.054113    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:29.640623    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:33.301952    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:33.315884    4361 daemon.go:422] nodeUpdateHandler(): no other node is draining
I0412 03:17:34.135984    4361 daemon.go:309] Starting workers
I0412 03:17:34.136015    4361 daemon.go:312] Started workers
I0412 03:17:34.136035    4361 daemon.go:352] worker queue size: 1
I0412 03:17:34.136043    4361 daemon.go:354] get item: 11
I0412 03:17:34.136056    4361 daemon.go:446] nodeStateSyncHandler(): new generation is 11
I0412 03:17:34.142049    4361 daemon.go:808] registerPlugins(): [intel_plugin mellanox_plugin]
I0412 03:17:34.142073    4361 daemon.go:666] loadVendorPlugins(): try to load plugin intel_plugin
I0412 03:17:34.142100    4361 plugin.go:39] loadPlugin(): load plugin from /plugins/intel_plugin.so
I0412 03:17:34.142129    4361 writer.go:61] Run(): refresh trigger
I0412 03:17:34.142160    4361 writer.go:80] pollNicStatus()
I0412 03:17:34.142193    4361 utils.go:49] DiscoverSriovDevices
I0412 03:17:34.177095    4361 utils.go:421] getNetdevMTU(): get MTU for device 0000:01:00.0
I0412 03:17:34.177149    4361 utils.go:416] tryGetInterfaceName(): name is eth4
I0412 03:17:34.177262    4361 utils.go:416] tryGetInterfaceName(): name is eth4
I0412 03:17:34.177269    4361 utils.go:442] getNetDevMac(): get Mac for device eth4
I0412 03:17:34.177291    4361 utils.go:454] getNetDevLinkSpeed(): get LinkSpeed for device eth4
W0412 03:17:34.177313    4361 utils.go:458] getNetDevLinkSpeed(): fail to read Link Speed file /sys/class/net/eth4/speed
I0412 03:17:34.177319    4361 utils.go:614] getLinkType(): Device 0000:01:00.0
I0412 03:17:34.177517    4361 utils.go:681] GetNicSriovMode(): device 0000:01:00.0
W0412 03:17:34.177585    4361 utils.go:116] DiscoverSriovDevices(): unable to get device mode 0000:01:00.0 "no such device"
I0412 03:17:34.177661    4361 utils.go:421] getNetdevMTU(): get MTU for device 0000:01:00.1
I0412 03:17:34.177703    4361 utils.go:416] tryGetInterfaceName(): name is eth5
I0412 03:17:34.177760    4361 utils.go:416] tryGetInterfaceName(): name is eth5
I0412 03:17:34.177765    4361 utils.go:442] getNetDevMac(): get Mac for device eth5
I0412 03:17:34.177789    4361 utils.go:454] getNetDevLinkSpeed(): get LinkSpeed for device eth5
W0412 03:17:34.177817    4361 utils.go:458] getNetDevLinkSpeed(): fail to read Link Speed file /sys/class/net/eth5/speed
I0412 03:17:34.177822    4361 utils.go:614] getLinkType(): Device 0000:01:00.1
I0412 03:17:34.177929    4361 utils.go:681] GetNicSriovMode(): device 0000:01:00.1
W0412 03:17:34.177981    4361 utils.go:116] DiscoverSriovDevices(): unable to get device mode 0000:01:00.1 "no such device"
I0412 03:17:34.178041    4361 utils.go:421] getNetdevMTU(): get MTU for device 0000:03:00.0
I0412 03:17:34.178089    4361 utils.go:416] tryGetInterfaceName(): name is eth0
I0412 03:17:34.178155    4361 utils.go:416] tryGetInterfaceName(): name is eth0
I0412 03:17:34.178160    4361 utils.go:442] getNetDevMac(): get Mac for device eth0
I0412 03:17:34.178183    4361 utils.go:454] getNetDevLinkSpeed(): get LinkSpeed for device eth0
W0412 03:17:34.178205    4361 utils.go:458] getNetDevLinkSpeed(): fail to read Link Speed file /sys/class/net/eth0/speed
Kubelet logs (journalctl -u kubelet)
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.473164    2560 cni.go:366] Error adding default_fortioserver-6b66b7bbd8-6zklm/b5310cd203d1ff907f1dd7b3ef58bc28b0cb1495828017903125
5497bd213dc4 to network multus/multus-cni-network: [default/fortioserver-6b66b7bbd8-6zklm:sriov-network-static]: error adding container to network "sriov-network-static": SRIOV-CNI failed to load net
conf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:03:01.6/physfn/net: no such file or directory"
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.676484    2560 cni.go:387] Error deleting default_fortioserver-6b66b7bbd8-6zklm/b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031
255497bd213dc4 from network multus/multus-cni-network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov w
ith name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.870750    2560 remote_runtime.go:113] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = [failed to set u
p sandbox container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" network for pod "fortioserver-6b66b7bbd8-6zklm": networkPlugin cni failed to set up pod "fortioserver-6b66b7bbd8
-6zklm_default" network: [default/fortioserver-6b66b7bbd8-6zklm:sriov-network-static]: error adding container to network "sriov-network-static": SRIOV-CNI failed to load netconf: LoadConf(): failed t
o get VF information: "lstat /sys/bus/pci/devices/0000:03:01.6/physfn/net: no such file or directory", failed to clean up sandbox container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd
213dc4" network for pod "fortioserver-6b66b7bbd8-6zklm": networkPlugin cni failed to teardown pod "fortioserver-6b66b7bbd8-6zklm_default" network: delegateDel: error invoking DelegateDel - "sriov": e
rror in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1]
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.870814    2560 kuberuntime_sandbox.go:69] CreatePodSandbox for pod "fortioserver-6b66b7bbd8-6zklm_default(3056ffc7-a2c5-4172-ac96-
b141a9b74586)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" network for pod "fortioserver-6b66b7bbd8
-6zklm": networkPlugin cni failed to set up pod "fortioserver-6b66b7bbd8-6zklm_default" network: [default/fortioserver-6b66b7bbd8-6zklm:sriov-network-static]: error adding container to network "sriov
-network-static": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:03:01.6/physfn/net: no such file or directory", failed to clean up sandb
ox container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" network for pod "fortioserver-6b66b7bbd8-6zklm": networkPlugin cni failed to teardown pod "fortioserver-6b66b7bbd8-6zkl
m_default" network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef
58bc28b0cb14958280179031255497bd213dc4-net1]
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.870826    2560 kuberuntime_manager.go:741] createPodSandbox for pod "fortioserver-6b66b7bbd8-6zklm_default(3056ffc7-a2c5-4172-ac96
-b141a9b74586)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" network for pod "fortioserver-6b66b7bbd
8-6zklm": networkPlugin cni failed to set up pod "fortioserver-6b66b7bbd8-6zklm_default" network: [default/fortioserver-6b66b7bbd8-6zklm:sriov-network-static]: error adding container to network "srio
v-network-static": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:03:01.6/physfn/net: no such file or directory", failed to clean up sand
box container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" network for pod "fortioserver-6b66b7bbd8-6zklm": networkPlugin cni failed to teardown pod "fortioserver-6b66b7bbd8-6zk
lm_default" network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3e
f58bc28b0cb14958280179031255497bd213dc4-net1]
Apr 12 11:17:30 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:30.870888    2560 pod_workers.go:191] Error syncing pod 3056ffc7-a2c5-4172-ac96-b141a9b74586 ("fortioserver-6b66b7bbd8-6zklm_default(
3056ffc7-a2c5-4172-ac96-b141a9b74586)"), skipping: failed to "CreatePodSandbox" for "fortioserver-6b66b7bbd8-6zklm_default(3056ffc7-a2c5-4172-ac96-b141a9b74586)" with CreatePodSandboxError: "CreatePo
dSandbox for pod \"fortioserver-6b66b7bbd8-6zklm_default(3056ffc7-a2c5-4172-ac96-b141a9b74586)\" failed: rpc error: code = Unknown desc = [failed to set up sandbox container \"b5310cd203d1ff907f1dd7b
3ef58bc28b0cb14958280179031255497bd213dc4\" network for pod \"fortioserver-6b66b7bbd8-6zklm\": networkPlugin cni failed to set up pod \"fortioserver-6b66b7bbd8-6zklm_default\" network: [default/forti
oserver-6b66b7bbd8-6zklm:sriov-network-static]: error adding container to network \"sriov-network-static\": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: \"lstat /sys/bu
s/pci/devices/0000:03:01.6/physfn/net: no such file or directory\", failed to clean up sandbox container \"b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4\" network for pod \"fortios
erver-6b66b7bbd8-6zklm\": networkPlugin cni failed to teardown pod \"fortioserver-6b66b7bbd8-6zklm_default\" network: delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from
 DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1]"
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: W0412 11:17:56.059141    2560 docker_sandbox.go:402] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for po
d "fortioserver-6b66b7bbd8-6zklm_default": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb1495828017903
1255497bd213dc4"
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: W0412 11:17:56.066390    2560 pod_container_deletor.go:79] Container "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" not found
 in pod's containers
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: W0412 11:17:56.068199    2560 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated contain
er "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4"
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:56.386410    2560 cni.go:387] Error deleting default_fortioserver-6b66b7bbd8-6zklm/b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031
255497bd213dc4 from network multus/multus-cni-network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov w
ith name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:56.386935    2560 remote_runtime.go:140] StopPodSandbox "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4" from runti
me service failed: rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "fortioserver-6b66b7bbd8-6zklm_default" network: delegateDel: error invoking DelegateDel - "sriov": error
in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:56.386977    2560 kuberuntime_manager.go:909] Failed to stop sandbox {"docker" "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255
497bd213dc4"}
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:56.388154    2560 kuberuntime_manager.go:688] killPodWithSyncResult failed: failed to "KillPodSandbox" for "3056ffc7-a2c5-4172-ac96-b
141a9b74586" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"fortioserver-6b66b7bbd8-6zklm_default\" network: delegateDel: error invoking Delega
teDel - \"sriov\": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1"
Apr 12 11:17:56 pubt2-k8s75.dg.163.org kubelet[2560]: E0412 11:17:56.388184    2560 pod_workers.go:191] Error syncing pod 3056ffc7-a2c5-4172-ac96-b141a9b74586 ("fortioserver-6b66b7bbd8-6zklm_default(
3056ffc7-a2c5-4172-ac96-b141a9b74586)"), skipping: failed to "KillPodSandbox" for "3056ffc7-a2c5-4172-ac96-b141a9b74586" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni
failed to teardown pod \"fortioserver-6b66b7bbd8-6zklm_default\" network: delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: error reading cached NetConf in
 /var/lib/cni/sriov with name b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4-net1"
Apr 12 11:17:57 pubt2-k8s75.dg.163.org kubelet[2560]: W0412 11:17:57.260159    2560 cni.go:333] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated contain
er "b5310cd203d1ff907f1dd7b3ef58bc28b0cb14958280179031255497bd213dc4"
repeat
zshi-redhat commented 3 years ago

I think the issue is when kubelet tries to restart pod after node rebooting, the sriov device is not ready (created yet).

I saw the log from sriov-config-daemon, so assuming sriov-operator is used. In which case, sriov-config-daemon is responsible for creating sriov numVfs, but sriov-config-daemon only starts after kubelet becomes ready, so there is chance that when kubelet starts to resume sriov pods, sriov-confg-daemon has not yet provisioned sriov numVfs.

I think the fix is to use systemd service for provisioning numVfs which could happen before kubelet becomes ready.

adrianchiris commented 3 years ago

I think the fix is to use systemd service for provisioning numVfs which could happen before kubelet becomes ready.

+1

adrianchiris commented 3 years ago

I do not think this issue is related to SR-IOV CNI as @zshi-redhat mentioned.

@arikachen you mind opening an issue in sriov-network-operator so it can be properly tracked ?

arikachen commented 3 years ago

Thanks for your suggestion. I will retest later.