k8snetworkplumbingwg / ovs-cni

Open vSwitch CNI plugin
Apache License 2.0
224 stars 71 forks source link

IPAM failure causes Delete Loop after the introduction of OVS-Port Cleanup on IPAM failure in ADD workflow #276

Closed joelcheison closed 6 months ago

joelcheison commented 1 year ago

Issue: On IPAM failure while using OVS-CNI, the pod goes into a delete loop [Stuck in either ContainerCreating or Init phase]. On closer inspection, the cause is the fact that cmdDel keeps failing and the CRI retries the delete, which ends up in a loop. There is no other way than to force delete the pod.

Logs: Warning FailedKillPod 3m47s (x38 over 24m) kubelet error killing pod: failed to "KillPodSandbox" for "c5c138c8-3f9c-4b99-a9b2-29e8a5e19158" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"2b7fd1a2d31f0568b8d11866b1d0274248210c4b429914a6a1ea02ab362a4d11\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"ovs\": error in getting result from DelNetwork: Failed to obtain OVS port for given connection: failed to find object from table Port"

What is happening: During CmdAdd, IPAM fails and the defer function to clean up ports is invoked which deletes the OVS port. The sandbox will be killed and recreated. During the CmdDel, the port cleanup function tries to find the port for cleanup. The port is not found[as it was deleted by the defer function in CmdAdd]. The issue arises at this point. The function that tries to find the port has 3 returns: port, portfound and error. The function however returns error when port is not found, it should only return portfound as false and error as nil. Error is to be returned only when the function was not able to determine the existence of the port.

There is no scenario where the " func (ovsd *OvsDriver) GetOvsPortForContIface(contIface, contNetnsPath string) (string, bool, error) " function returns ["", false,nil], to indicate port not found [source: https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/ovsdb/ovsdb.go]

The following code for CmdDel in [https://github.com/k8snetworkplumbingwg/ovs-cni/blob/main/pkg/plugin/plugin.go] expects otherwise

    portName, portFound, err := getOvsPortForContIface(ovsDriver, args.IfName, args.Netns)
    if err != nil {
        return fmt.Errorf("Failed to obtain OVS port for given connection: %v", err)
    }

Returns error above

    // Do not return an error if the port was not found, it may have been
    // already removed by someone
    if portFound {
        if err := removeOvsPort(ovsDriver, portName); err != nil {
            return err
        }
    }

Due to this behavior of "func (ovsd *OvsDriver) GetOvsPortForContIface", if the port does not exist during CmdDel[one way is the defer function in CmdAdd deleting the port], CmdDel fails. The CRI will iteratively try to Kill the sandbox, each attempt calls the CmdDel which fails and this causes a delete loop.

phoracek commented 1 year ago

Hello @joelcheison, thanks for reporting this. CNI DEL should never fail, so the current behavior is clearly wrong. Since you were already able to pinpoint the source of the problem, would you be willing to post a fix?

ykulazhenkov commented 6 months ago

Should be fixed by https://github.com/k8snetworkplumbingwg/ovs-cni/pull/305

phoracek commented 6 months ago

Thanks @ykulazhenkov! Closing this