Closed nayihz closed 2 months ago
Mostly -- I think during cmdDel we want to ignore errors so that the pod can be deleted by the API. Otherwise, we retry on delete, and then the pod can hang around and crashloop. So, with CNI mentality in mind, we're very sensitive about successes on ADD but on DEL, we're very lenient with letting CNI DELs fail so that pods aren't stuck in a crashloop
we're very lenient with letting CNI DELs fail so that pods aren't stuck in a crashloop
But it will lead to resource leak(such as ip leak) if cmdDel failed. Kubelet could retry to delete sandbox if it knows cmdDel
failed.
The CNI spec reads that: https://github.com/containernetworking/cni/blob/main/SPEC.md#del-remove-container-from-network-or-un-apply-modifications
Plugins should generally complete a DEL action without error even if some resources are missing.
So while you have a point that sometimes there will be resources left behind, this is the suggested behavior for CNI plugins on a CNI DEL
If we don't use multus, pod will stuck after cmdDel
failed. But we get completely inconsistent results when using multus. It's confusing for users.
I think there should be a field in multus' CNI config to describe whether to tolerate errors returned by CNI Del. Giving control of the process to the user helps improve the usability of multus
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
What happend: In https://github.com/k8snetworkplumbingwg/multus-cni/pull/1084, this pr wil ignore the common errors raised by CNI. I don't think this is what we expected. For example, if a custom cni cmdDel return an error, multus-shim will aslo only log this error but ignore it. So containerd believe it's a successful deletion while we can see TearDown network for sandbox xxx successfully even if it failed to do cmdDel actually.
Here are some containerd logs:
What you expected to happen: multus should wrap the error raised by CNI, so kubelet could know that to prevent the pod to be deleted. How to reproduce it (as minimally and precisely as possible): write a fake CNI to mock cmdDel always return an error. Anything else we need to know?:
Environment:
kubectl version
):kubectl get net-attach-def -o yaml
)kubectl get pod <podname> -o yaml
)