k8snetworkplumbingwg / sriov-network-device-plugin

SRIOV network device plugin for Kubernetes
Apache License 2.0
410 stars 177 forks source link

pod with SRIOV net device cannot be created #261

Closed peng-isi closed 4 years ago

peng-isi commented 4 years ago

What issue would you like to bring attention to?

I tried to create a pod with SRIOV net device (e.g. Mellanox IB), but the pod stuck in ContainerCreating. I configured 4 VFs on the IB interface of the host. I run device plugin pod and Multus CNI meta-plugin.

The device plugin can detect the SRIOV net device on the host (hp6 in my experiment), the output is shown in the following:

1.# kubectl get node hp6 -o json | jq '.status.allocatable' { "cpu": "16", "devices.kubevirt.io/kvm": "110", "devices.kubevirt.io/tun": "110", "devices.kubevirt.io/vhost-net": "110", "ephemeral-storage": "530610004728", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "mellanox.com/mlnx_sriov_netdevice": "4", "memory": "49382120Ki", "pods": "110" }

  1. The NetworkAttachmentDefinition I used is shown in the following:

apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: sriov-net1 annotations: k8s.v1.cni.cncf.io/resourceName: mellanox.com/mlnx_sriov_netdevice spec: config: '{ "type": "sriov", "cniVersion": "0.3.1", "name": "sriov-network", "ipam": { "type": "host-local", "subnet": "10.56.217.0/24", "routes": [{ "dst": "0.0.0.0/0" }], "gateway": "10.56.217.1" } }'

  1. The pod is created as follows:

    apiVersion: v1 kind: Pod metadata: name: testpod1 annotations: k8s.v1.cni.cncf.io/networks: sriov-net1 spec: containers:

    • name: appcntr1 image: centos/tools imagePullPolicy: IfNotPresent command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 300000; done;" ] resources: requests: mellanox.com/mlnx_sriov_netdevice: '1' limits: mellanox.com/mlnx_sriov_netdevice: '1'
  2. The log for the pod (testpod) is appended as follows:

[hp]# kubectl describe pod testpod Name: testpod1 Namespace: default Priority: 0 Node: hp6/10.13.206.66 Start Time: Thu, 06 Aug 2020 23:32:03 -0400 Labels: Annotations: k8s.v1.cni.cncf.io/networks: sriov-net1 Status: Pending IP:
IPs: Containers: appcntr1: Container ID:
Image: centos/tools Image ID:
Port: Host Port: Command: /bin/bash -c

Args:
  while true; do sleep 300000; done;
State:          Waiting
  Reason:       ContainerCreating
Ready:          False
Restart Count:  0
Limits:
  mellanox.com/mlnx_sriov_netdevice:  1
Requests:
  mellanox.com/mlnx_sriov_netdevice:  1
Environment:                          <none>
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-p7lqf (ro)

Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: default-token-p7lqf: Type: Secret (a volume populated by a Secret) SecretName: default-token-p7lqf Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled 5s default-scheduler Successfully assigned default/testpod1 to hp6 Warning FailedCreatePodSandBox kubelet, hp6 Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "32e2882a38d1fd4bfbc80effa237a4f3fed390597fd1148524db7775f0036fd7" network for pod "testpod1": networkPlugin cni failed to set up pod "testpod1_default" network: Multus: [default/testpod1]: error adding container to network "sriov-network": delegateAdd: error invoking DelegateAdd - "sriov": error in getting result from AddNetwork: SRIOV-CNI failed to configure VF "failed to find vf 3", failed to clean up sandbox container "32e2882a38d1fd4bfbc80effa237a4f3fed390597fd1148524db7775f0036fd7" network for pod "testpod1": networkPlugin cni failed to teardown pod "testpod1_default" network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name 32e2882a38d1fd4bfbc80effa237a4f3fed390597fd1148524db7775f0036fd7-net1] Warning FailedCreatePodSandBox kubelet, hp6 Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "a45d61595967926346dc3853301d6f5a1aeaffbe2ecb9bbd026063bd13fff643" network for pod "testpod1": networkPlugin cni failed to set up pod "testpod1_default" network: Multus: [default/testpod1]: error adding container to network "sriov-network": delegateAdd: error invoking DelegateAdd - "sriov": error in getting result from AddNetwork: SRIOV-CNI failed to configure VF "failed to find vf 3", failed to clean up sandbox container "a45d61595967926346dc3853301d6f5a1aeaffbe2ecb9bbd026063bd13fff643" network for pod "testpod1": networkPlugin cni failed to teardown pod "testpod1_default" network: delegateDel: error invoking DelegateDel - "sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/sriov with name a45d61595967926346dc3853301d6f5a1aeaffbe2ecb9bbd026063bd13fff643-net1] Normal SandboxChanged (x2 over ) kubelet, hp6 Pod sandbox changed, it will be killed and re-created.

What is the impact of this issue?

Do you have a proposed response or remediation for the issue?

zshi-redhat commented 4 years ago

@peng-isi you might want to try ib-sriov-cni (the cni for infiniband sriov devices), instead of sriov-cni

peng-isi commented 4 years ago

@zshi-redhat thanks. Your suggestion works!

zshi-redhat commented 4 years ago

@zshi-redhat thanks. Your suggestion works!

Cool, glad to hear it worked!