Race condition on node startup causing Pods to get stuck in ContainerCreating

What happened: Pod stuck in ContainerCreating. Events show a "failed to setup network for sandbox" loop.

  Warning  FailedCreatePodSandBox  2m2s (x16 over 3m)  kubelet      (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a6092ae23ccfc6d2eadf18642437053ede456a3a45d1f1a748b9fcd827a92c85": plugin type="multus" name="multus-cni-network" failed (add): [abc-09f7b-jp4rb/mango-01/f84c6bb9-d487-4c6f-a22b-fae50463e461:multus-cni-network]: error adding container to network "multus-cni-network": [abc-09f7b-jp4rb/mango-01/f84c6bb9-d487-4c6f-a22b-fae50463e461:abc-09f7b-jp4rb-main-mesh]: error adding container to network "abc-09f7b-jp4rb-main-mesh": DelegateAdd: cannot set "ovn-k8s-cni-overlay" interface name to "pod21e2223678f": validateIfName: interface name pod21e2223678f already exists
  Normal   AddedInterface          2m                  multus       Add eth0 [100.64.9.221/32] from aws-cni
  Normal   AddedInterface          2m                  multus       Add pod21e2223678f [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-main-mesh
  Normal   AddedInterface          2m                  multus       Add pod1f9bb198e14 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-lemon-mesh
  Normal   AddedInterface          119s                multus       Add pod42c79b6cd76 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-mango-mesh-01
  Normal   AddedInterface          119s                multus       Add eth0 [100.64.9.221/32] from multus-cni-network
  Normal   AddedInterface          116s                multus       Add eth0 [100.64.0.250/32] from aws-cni
  Normal   AddedInterface          116s                multus       Add pod21e2223678f [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-main-mesh
  Normal   AddedInterface          116s                multus       Add pod1f9bb198e14 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-lemon-mesh
  Normal   AddedInterface          115s                multus       Add pod42c79b6cd76 [] from abc-09f7b-jp4rb/abc-09f7b-jp4rb-mango-mesh-01
  Normal   AddedInterface          115s                multus       Add eth0 [100.64.0.250/32] from multus-cni-network
  Normal   AddedInterface          112s                multus       Add eth0 [100.64.46.81/32] from aws-cni
  Normal   AddedInterface          112s                multus       Add pod21e2223678f [] from abc-09f7b-jp4

Manual resolution: Deleting /etc/cni/net.d/00-multus.conf file and etc/cni/net.d/multus.d and restarting the daemonset pod resolves this issue for subsequent pods being scheduled. Existing pods finish creating but leave the networking in a bad state (e.g. separate from the above example, after resolving I've seen interface pod6c270ef2f25 not found: route ip+net: no such network interface)

How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce unfortunately. I believe it's a race condition happening less than 1% of the time. My EKS cluster has instances constantly scaling up and down throughout the day, but I've only seen this 2-3x in the past few months.

Possibly related? https://github.com/k8snetworkplumbingwg/multus-cni/issues/1221, though this isn't happening after a node reboot, and I'm not using the thick plugin.

Anything else we need to know?: Below is the /etc/cni/net.d/00-multus.conf on a bad node, which I suspect is wrong. delegates is nested within delegates, and all the top-level fields are repeated again. It's like a bad merge happened. This is not the same as what's on a working node. (Sorry, I trimmed out some of the config when sharing with my team and the node doesn't exist anymore, so this is all I have):

{
  "cniVersion": "0.4.0",
  "name": "multus-cni-network",
  "type": "multus",
  "capabilities": {
    "portMappings": true
  },
  "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
  "delegates": [
    {
      "cniVersion": "0.4.0",
      "name": "multus-cni-network",
      "type": "multus",
      "capabilities": {
        "portMappings": true
      },
      "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
      "delegates": [
        {
          "cniVersion": "0.4.0",
          "disableCheck": true,
          "name": "aws-cni",
          "plugins": [
            {
              ...
            },
            {
              ...
            },
            {
              "capabilities": {
                "portMappings": true
              },
              "snat": true,
              "type": "portmap"
            }
          ]
        }
      ],
    }
  ]
}

/etc/cni/net.d/00-multus.conf on a WORKING node:

{
  "cniVersion": "0.4.0",
  "name": "multus-cni-network",
  "type": "multus",
  "capabilities": {
    "portMappings": true
  },
  "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
  "delegates": [
    {
      "cniVersion": "0.4.0",
      "disableCheck": true,
      "name": "aws-cni",
      "plugins": [
        {
          "mtu": "9001",
          "name": "aws-cni",
          "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
          "pluginLogLevel": "DEBUG",
          "podSGEnforcingMode": "strict",
          "type": "aws-cni",
          "vethPrefix": "eni"
        }
      ]
    },
    {
      "enabled": "false",
      "ipam": {
        "dataDir": "/run/cni/v4pd/egress-v6-ipam",
        "ranges": [
          [
            {
              "subnet": "fd00::ac:00/118"
            }
          ]
        ],
        "routes": [
          {
            "dst": "::/0"
          }
        ],
        "type": "host-local"
      },
      "mtu": "9001",
      "name": "egress-cni",
      "nodeIP": "",
      "pluginLogFile": "/var/log/aws-routed-eni/egress-v6-plugin.log",
      "pluginLogLevel": "DEBUG",
      "randomizeSNAT": "prng",
      "type": "egress-cni"
    },
    {
      "capabilities": {
        "portMappings": true
      },
      "snat": true,
      "type": "portmap"
    }
  ]
}

Potentially another clue: it's the norm for multus daemonset pods to fail 2x on startup with:

Defaulted container "kube-multus" out of: kube-multus, install-multus-binary (init)
kubeconfig is created in /host/etc/cni/net.d/multus.d/multus.kubeconfig
kubeconfig file is created.
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
main.(*Options).createMultusConfig(0xc000236200)
    /usr/src/multus-cni/cmd/thin_entrypoint/main.go:297 +0x1f45
main.main()
    /usr/src/multus-cni/cmd/thin_entrypoint/main.go:539 +0x445

Happens to almost every new pod:

kube-multus-ds-248xv         1/1     Running   2 (4d1h ago)    4d1h
kube-multus-ds-24sqt         1/1     Running   2 (2d2h ago)    2d2h
kube-multus-ds-5kgtn         1/1     Running   2 (25h ago)     25h
kube-multus-ds-5mk26         1/1     Running   2 (47h ago)     47h
kube-multus-ds-5vk77         1/1     Running   2 (6h57m ago)   6h57m
kube-multus-ds-672tc         1/1     Running   2 (7d6h ago)    7d6h
kube-multus-ds-68fvr         1/1     Running   2 (52m ago)     52m
kube-multus-ds-6jn94         1/1     Running   2 (8d ago)      8d
kube-multus-ds-78qts         1/1     Running   2 (4d3h ago)    4d3h
kube-multus-ds-7n7sj         1/1     Running   2 (4d3h ago)    4d3h
kube-multus-ds-bcrmp         1/1     Running   0               77d
kube-multus-ds-f84xn         1/1     Running   2 (3d ago)      3d
kube-multus-ds-flsf6         1/1     Running   2 (50m ago)     50m
kube-multus-ds-fqj2j         1/1     Running   2 (24h ago)     24h
kube-multus-ds-hnj84         1/1     Running   2 (2d1h ago)    2d1h
kube-multus-ds-hss8c         1/1     Running   0               77d
kube-multus-ds-hvlr8         1/1     Running   2 (26h ago)     26h
kube-multus-ds-kpsqt         1/1     Running   2 (6h57m ago)   6h57m
kube-multus-ds-l26qr         1/1     Running   2 (30h ago)     30h
kube-multus-ds-lqtmp         1/1     Running   2 (30h ago)     30h
kube-multus-ds-mg2gz         1/1     Running   2 (30h ago)     30h
kube-multus-ds-n486d         1/1     Running   2 (20h ago)     20h
kube-multus-ds-nsk4q         1/1     Running   0               77d
kube-multus-ds-ntf2r         1/1     Running   0               8d
kube-multus-ds-pw2q6         1/1     Running   2 (21h ago)     21h
kube-multus-ds-r82lj         1/1     Running   0               77d

Found an issue related to this: https://github.com/k8snetworkplumbingwg/multus-cni/issues/1092, I am not using OVN-Kind, but am running ovn-kubernetes as a secondary CNI. Not sure why ovn-kubernetes would affect this.

Environment:

   kube-multus:
    Image:      artifactory.seastco.dev/public-images/k8snetworkplumbingwg/multus-cni:v4.0.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /thin_entrypoint
    Args:
      --multus-conf-file=auto
      --multus-autoconfig-dir=/host/etc/cni/net.d
      --cni-conf-dir=/host/etc/cni/net.d

Kubernetes version (use kubectl version): v1.25.16-eks-3af4770
Primary CNI for Kubernetes cluster: AWS CNI

OS (e.g. from /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

k8snetworkplumbingwg / multus-cni

Race condition on node startup causing Pods to get stuck in ContainerCreating #1312