Closed joelcheison closed 6 months ago
Hello @joelcheison, thanks for reporting this. CNI DEL should never fail, so the current behavior is clearly wrong. Since you were already able to pinpoint the source of the problem, would you be willing to post a fix?
Should be fixed by https://github.com/k8snetworkplumbingwg/ovs-cni/pull/305
Thanks @ykulazhenkov! Closing this
Issue: On IPAM failure while using OVS-CNI, the pod goes into a delete loop [Stuck in either ContainerCreating or Init phase]. On closer inspection, the cause is the fact that cmdDel keeps failing and the CRI retries the delete, which ends up in a loop. There is no other way than to force delete the pod.
Logs: Warning FailedKillPod 3m47s (x38 over 24m) kubelet error killing pod: failed to "KillPodSandbox" for "c5c138c8-3f9c-4b99-a9b2-29e8a5e19158" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"2b7fd1a2d31f0568b8d11866b1d0274248210c4b429914a6a1ea02ab362a4d11\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"ovs\": error in getting result from DelNetwork: Failed to obtain OVS port for given connection: failed to find object from table Port"
What is happening: During CmdAdd, IPAM fails and the defer function to clean up ports is invoked which deletes the OVS port. The sandbox will be killed and recreated. During the CmdDel, the port cleanup function tries to find the port for cleanup. The port is not found[as it was deleted by the defer function in CmdAdd]. The issue arises at this point. The function that tries to find the port has 3 returns: port, portfound and error. The function however returns error when port is not found, it should only return portfound as false and error as nil. Error is to be returned only when the function was not able to determine the existence of the port.
There is no scenario where the " func (ovsd *OvsDriver) GetOvsPortForContIface(contIface, contNetnsPath string) (string, bool, error) " function returns ["", false,nil], to indicate port not found [source: https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/ovsdb/ovsdb.go]
The following code for CmdDel in [https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/plugin/plugin.go] expects otherwise
Returns error above
Due to this behavior of "func (ovsd *OvsDriver) GetOvsPortForContIface", if the port does not exist during CmdDel[one way is the defer function in CmdAdd deleting the port], CmdDel fails. The CRI will iteratively try to Kill the sandbox, each attempt calls the CmdDel which fails and this causes a delete loop.