Open SchSeba opened 5 months ago
[root@virtual-worker-0 centos]# ps -ef | grep 942
root 942 1 5 17:07 ? 00:00:00 /usr/bin/crio
root 1246 942 0 17:07 ? 00:00:00 /opt/cni/bin/multus-shim
root 2745 2395 0 17:08 pts/0 00:00:00 grep --color=auto 942
from crio:
from CNI network \"multus-cni-network\": plugin type=\"multus-shim\" name=\"multus-cni-network\" failed (delete): netplugin failed with no error message: signal: killed"
just update doing -f looks like fix the issue in the copy command
Coincidentally, we also saw this error crop up yesterday with one of our edge clusters after rebooting.
As an FYI i see different deployment yamls use different way to copy the cni binary in init container:
the first one[1] will use install_multus
which will copy files in an atomic manner. the latter[2] will just use cp
.
(install_multus
support both thick and thin plugin types)
although im not sure that copying file atomically will solve the above issue.
also deployments/multus-daemonset-crio.yml
does not use init contianer.
This should hopefully be addressed with #1213
Saw this in minikube today. No rebooting, just staring up a new minikube cluster.
I also got a reproduction after rebooting a node and having multus restart.
I mitigated it by deleting /opt/cni/bin/multus-shim
, but, yeah, I'll retest with the above patch
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
customresourcedefinition.apiextensions.k8s.io/network-attachment-definitions.k8s.cni.cncf.io created
clusterrole.rbac.authorization.k8s.io/multus created
clusterrolebinding.rbac.authorization.k8s.io/multus created
serviceaccount/multus created
configmap/multus-daemon-config created
daemonset.apps/kube-multus-ds created
[fedora@labkubedualhost-master-1 whereabouts]$ watch -n1 kubectl get pods -A -o wide
[fedora@labkubedualhost-master-1 whereabouts]$ kubectl logs kube-multus-ds-fzdcr -n kube-system
Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
Error from server (BadRequest): container "kube-multus" in pod "kube-multus-ds-fzdcr" is waiting to start: PodInitializing
Seems I can make this happen anytime I ungracefully restart a node, worker or master it creates this error and stops pod network sandbox recreation completely on that node.
The fix mentioned above does work, but this likely means a power outage of a node will require manual intervention whereas otherwise without multus this is not required, this error should be handled properly.
+1. This seems like a pretty serious issue. Can we get a fix merge for it soon please?
Additionally can confirm this behavior. as @dougbtv mentioned... removing /opt/cni/bin/multus-shim
works as a workaround.
+1 happend to me as well, cluster did not come up. Any chance to fix this soon?
same here, cluster kubespray 1.29
Certainly need to fix right away.
Hi, it looks like there is an issue after a node reboot where we can have a race in multus that will prevent the pod from starting
The problem is mainly after reboot that the multus-shim gets called by crio to start pods but the multus pod is not able to start because the init container fails to cp the shim. The reason it failed to copy is because crio called the shim who is stuck waiting for the communication with the pod