Closed seastco closed 3 months ago
OK, still not totally sure about the root cause of the config being a mess but I've learned more since raising this issue and can work around it.
10-aws.conflist
. Because this file takes a second to be created, and these daemonset pods are starting up at the same time on a new node, multus will fail and restart. OK that's fine. Red herring.00-multus.conf
result highlighted above. I changed resources requests == limits on the init container to give the Multus pod Guaranteed QoS and to stop it from being evicted.v4.1.0
and set --cleanup-config-on-exit=true
. Now the 00-multus.conf
is removed on pod teardown.So again, not sure why 00-multus.conf
file isn't resilient to restarts, but if you're running into this issue consider making your pods have Guaranteed QoS and/or setting --cleanup-config-on-exit=true
.
EDIT - I've dumped a lot of irrelevant info into this thread so I'm going to close this issue and create a new one about 00-multus.conf not being resilient to restarts.
What happened: Pod stuck in ContainerCreating. Events show a "failed to setup network for sandbox" loop.
Manual resolution: Deleting
/etc/cni/net.d/00-multus.conf
file andetc/cni/net.d/multus.d
and restarting the daemonset pod resolves this issue for subsequent pods being scheduled. Existing pods finish creating but leave the networking in a bad state (e.g. separate from the above example, after resolving I've seeninterface pod6c270ef2f25 not found: route ip+net: no such network interface
)How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce unfortunately. I believe it's a race condition happening less than 1% of the time. My EKS cluster has instances constantly scaling up and down throughout the day, but I've only seen this 2-3x in the past few months.
Possibly related? https://github.com/k8snetworkplumbingwg/multus-cni/issues/1221, though this isn't happening after a node reboot, and I'm not using the thick plugin.
Anything else we need to know?: Below is the
/etc/cni/net.d/00-multus.conf
on a bad node, which I suspect is wrong.delegates
is nested withindelegates
, and all the top-level fields are repeated again. It's like a bad merge happened. This is not the same as what's on a working node. (Sorry, I trimmed out some of the config when sharing with my team and the node doesn't exist anymore, so this is all I have):/etc/cni/net.d/00-multus.conf
on a WORKING node:Potentially another clue: it's the norm for multus daemonset pods to fail 2x on startup with:
Happens to almost every new pod:
Found an issue related to this: https://github.com/k8snetworkplumbingwg/multus-cni/issues/1092, I am not using OVN-Kind, but am running ovn-kubernetes as a secondary CNI. Not sure why ovn-kubernetes would affect this.
Environment:
kubectl version
):v1.25.16-eks-3af4770