Closed arikachen closed 3 years ago
I think the issue is when kubelet tries to restart pod after node rebooting, the sriov device is not ready (created yet).
I saw the log from sriov-config-daemon, so assuming sriov-operator is used. In which case, sriov-config-daemon is responsible for creating sriov numVfs, but sriov-config-daemon only starts after kubelet becomes ready, so there is chance that when kubelet starts to resume sriov pods, sriov-confg-daemon has not yet provisioned sriov numVfs.
I think the fix is to use systemd service for provisioning numVfs which could happen before kubelet becomes ready.
I think the fix is to use systemd service for provisioning numVfs which could happen before kubelet becomes ready.
+1
I do not think this issue is related to SR-IOV CNI as @zshi-redhat mentioned.
@arikachen you mind opening an issue in sriov-network-operator so it can be properly tracked ?
Thanks for your suggestion. I will retest later.
What happened?
When node restart, pod is Error status. CNIADD failed with netdevice not exist and CNIDEL failed too. Events:
What did you expect to happen?
pod will start success.
What are the minimal steps needed to reproduce the bug?
sriov with mlnx card reboot node
Component Versions
Please fill in the below table with the version numbers of applicable components used.
Logs
SR-IOV Network Config Daemon Logs
Kubelet logs (journalctl -u kubelet)