k8snetworkplumbingwg / multus-cni

A CNI meta-plugin for multi-homed pods in Kubernetes
Apache License 2.0
2.41k stars 584 forks source link

[BUG] CNIs are attaching very slow while deploying large scale deployment #1344

Open jslouisyou opened 1 month ago

jslouisyou commented 1 month ago

What happend: CNIs are attaching very slow while deploying large scale deployment, such as 2,400 Pods in 300 Nodes. I configured to deploy 8 Pods per node and each Pod contains 1 GPU and 5 SR-IOV VFs.

What you expected to happen: CNIs will be attached well, without an error.

How to reproduce it (as minimally and precisely as possible): Deploy deployment or daemonset with SR-IOV VFs

Anything else we need to know?: I'm using sriov-network-operator v1.2.0 (btw I upgraded ib-sriov-cni version from v1.0.2 to v1.0.3 due to fix an error) and multus-cni v3.8 (delegated from kubespray), whereabouts v0.7 in order to use Infiniband VFs in k8s.

Here's the description of Pods which error occurs:

Events:
  Type     Reason                  Age                  From               Message
  ----     ------                  ----                 ----               -------
  Normal   Scheduled               48m                  default-scheduler  Successfully assigned default/mpi-test-d55775bb6-ccc5r to srh100-570
  Normal   AddedInterface          47m                  multus             Add eth0 [10.11.244.50/32] from k8s-pod-network
  Normal   AddedInterface          47m                  multus             Add net1 [192.168.192.96/20] from default/sriov-gpu-ib0
  Warning  FailedCreatePodSandBox  43m (x5 over 43m)    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_0": name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_0" is reserved for "a4861aef5ea2610c7701b895918a10d823195ec858daece333f9654523d58142"
  Normal   AddedInterface          41m                  multus             Add eth0 [10.11.244.55/32] from k8s-pod-network
  Normal   AddedInterface          37m                  multus             Add eth0 [10.11.244.56/32] from k8s-pod-network
  Normal   AddedInterface          33m                  multus             Add net1 [192.168.193.101/20] from default/sriov-gpu-ib0
  Warning  FailedCreatePodSandBox  33m (x3 over 43m)    kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  32m (x5 over 32m)    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_1": name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_1" is reserved for "41c1b5157dec676c45f030ee123da5bbd9f26a1660efac5e6d2b89516d0baee3"
  Normal   SandboxChanged          29m (x9 over 42m)    kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   AddedInterface          25m                  multus             Add eth0 [10.11.244.49/32] from k8s-pod-network
  Normal   AddedInterface          21m                  multus             Add net1 [192.168.194.14/20] from default/sriov-gpu-ib0
  Normal   AddedInterface          17m                  multus             Add net2 [192.168.211.217/20] from default/sriov-gpu-ib1
  Warning  FailedCreatePodSandBox  15m                  kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedCreatePodSandBox  14m (x5 over 15m)    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to reserve sandbox name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_2": name "mpi-test-d55775bb6-ccc5r_default_148c0437-446b-4a18-98ee-0f140509ed75_2" is reserved for "ed53bee3044f76c889eaf8054222818fa31508ca52ed37292fdb5873d4afec50"
  Normal   SandboxChanged          9m14s (x8 over 29m)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   AddedInterface          4m37s                multus             Add eth0 [10.11.244.55/32] from k8s-pod-network
  Normal   AddedInterface          2m47s                multus             Add net1 [192.168.200.21/20] from default/sriov-gpu-ib0
  Normal   AddedInterface          112s                 multus             Add net2 [192.168.212.133/20] from default/sriov-gpu-ib1
  Normal   AddedInterface          74s                  multus             Add net3 [192.168.226.3/20] from default/sriov-gpu-ib2
  Normal   AddedInterface          68s                  multus             Add net4 [192.168.240.214/20] from default/sriov-gpu-ib3
  Normal   AddedInterface          58s                  multus             Add net5 [192.169.0.105/20] from default/sriov-gpu-ib4

When I deployed, I saw that adding CNI from multus took a very long time in first trial, around 4 minutes per CNI. But it seemed there was a timeout so kubelet tried to recreate and restart container and multus also tried to attach CNI but eventually this procedure repeated several times.

The last time CNIs were attached successfully, but at that moment I deleted deployment so almost all Pods were deleting phase.

And here's network-attachment-definition I use (Also there are several resources):

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-gpu-ib2
  namespace: default
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "sriov_gpu_ib2",
      "plugins": [
        {
          "type": "ib-sriov",
          "link_state": "enable",
          "rdmaIsolation": true,
          "ibKubernetesEnabled": false,
          "ipam": {
            "datastore": "kubernetes",
            "kubernetes": {
              "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
            },
            "log_file": "/tmp/whereabouts.log",
            "log_level": "debug",
            "type": "whereabouts",
            "enable_overlapping_ranges": false,
            "range": "192.168.224.0/20"
          }
        }
      ]
    }

Is there any idea why multus took so long to attach CNIs into Pods?

Thanks.

Environment: