k8snetworkplumbingwg / ib-sriov-cni

InfiniBand SR-IOV CNI
Other
42 stars 27 forks source link

mellanox SRIOV demo pod cannot be created #66

Open jason-gideon opened 2 years ago

jason-gideon commented 2 years ago

I tried to create a pod with SRIOV net device (e.g. Mellanox IB), but the pod stuck in ContainerCreating. I configured 4 VFs on the IB interface of the host. I run device plugin pod and Multus CNI meta-plugin. but the SRIOV demo pod show ERROR

multus

./multus-daemonset-thick-plugin.yml:125: image: ghcr.io/k8snetworkplumbingwg/multus-cni:v3.9.2-thick-amd64

ERROR

n-MacBookPro:~/20-k8s-rdma-sriov/ib-sriov-cni/deployment/examples$ kubectl describe po my-test-pod-fnjk7
Name:         my-test-pod-fnjk7
Namespace:    default
Priority:     0
Node:         s-113-2-35/10.113.2.35
Start Time:   Tue, 22 Nov 2022 20:22:33 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: 848157aeb2b3549aa8e2fce419c8353989ecb98ad62b1c6513f46423492f6cfd
              cni.projectcalico.org/podIP:
              cni.projectcalico.org/podIPs:
              k8s.v1.cni.cncf.io/networks: [{"name": "ib-sriov-network"}]
Status:       Pending
IP:
IPs:          <none>
Containers:
  my-test-ctr:
    Container ID:
    Image:         mellanox/rping-test
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      sleep 1000000

    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      mellanox.com/mlnx_sriov_rdma_ib:  1
    Requests:
      mellanox.com/mlnx_sriov_rdma_ib:  1
    Environment:                        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2clfq (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-2clfq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               21s                default-scheduler  Successfully assigned default/my-test-pod-fnjk7 to s-113-2-35
  Normal   AddedInterface          21s                multus             Add eth0 [10.42.0.21/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  21s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "ef4b067661534edfacd217cb1ea3cb1b2cdd44f65ffc1067a59091a2ae6490be" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "ef4b067661534edfacd217cb1ea3cb1b2cdd44f65ffc1067a59091a2ae6490be" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name ef4b067661534edfacd217cb1ea3cb1b2cdd44f65ffc1067a59091a2ae6490be-net1]
  Normal   AddedInterface          20s                multus             Add eth0 [10.42.0.22/32] from k8s-pod-network
  Normal   AddedInterface          19s                multus             Add eth0 [10.42.0.23/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  19s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3573ba2407bcf6bacb171e5e8b32980ff549a59de1bd8b119d89f6304ae69b7c" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "3573ba2407bcf6bacb171e5e8b32980ff549a59de1bd8b119d89f6304ae69b7c" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 3573ba2407bcf6bacb171e5e8b32980ff549a59de1bd8b119d89f6304ae69b7c-net1]
  Warning  FailedCreatePodSandBox  18s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "0cdbf8cb322a3156d88f04a52c2bea0fc51511ffa6d21b4db9aa4ae44dc858e2" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "0cdbf8cb322a3156d88f04a52c2bea0fc51511ffa6d21b4db9aa4ae44dc858e2" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 0cdbf8cb322a3156d88f04a52c2bea0fc51511ffa6d21b4db9aa4ae44dc858e2-net1]
  Normal   AddedInterface          18s                multus             Add eth0 [10.42.0.24/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  17s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "93bbd85125dc93d15558f34aa2693d13781db6d38905925814151160ef405dc9" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "93bbd85125dc93d15558f34aa2693d13781db6d38905925814151160ef405dc9" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 93bbd85125dc93d15558f34aa2693d13781db6d38905925814151160ef405dc9-net1]
  Normal   AddedInterface          17s                multus             Add eth0 [10.42.0.25/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  16s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8ea7b9cda5014ae0e8a3f335903e83c542156c4ec8de84c80a627ef3c3473cb1" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "8ea7b9cda5014ae0e8a3f335903e83c542156c4ec8de84c80a627ef3c3473cb1" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 8ea7b9cda5014ae0e8a3f335903e83c542156c4ec8de84c80a627ef3c3473cb1-net1]
  Normal   AddedInterface          16s                multus             Add eth0 [10.42.0.26/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  15s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "922f59df03433b78b31201f685867ac475fcb96c5b4791eecd642fe87b5ae365" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "922f59df03433b78b31201f685867ac475fcb96c5b4791eecd642fe87b5ae365" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 922f59df03433b78b31201f685867ac475fcb96c5b4791eecd642fe87b5ae365-net1]
  Normal   AddedInterface          15s                multus             Add eth0 [10.42.0.27/32] from k8s-pod-network
  Normal   AddedInterface          14s                multus             Add eth0 [10.42.0.28/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  14s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "68c5c26e73706571b562dfa035e6b53e848f7cc18c85b8a3995f0a2a3c338b97" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "68c5c26e73706571b562dfa035e6b53e848f7cc18c85b8a3995f0a2a3c338b97" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 68c5c26e73706571b562dfa035e6b53e848f7cc18c85b8a3995f0a2a3c338b97-net1]
  Warning  FailedCreatePodSandBox  13s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "60fda0e94bf41698460e2406a00d6443299a9b176da7ed8004f39adfc2bb16e0" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "60fda0e94bf41698460e2406a00d6443299a9b176da7ed8004f39adfc2bb16e0" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 60fda0e94bf41698460e2406a00d6443299a9b176da7ed8004f39adfc2bb16e0-net1]
  Normal   AddedInterface          12s                multus             Add eth0 [10.42.0.29/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  12s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "777d1178ca6d8681b1f0f43780fb357c0dce74a6905c94337c2f07ef9a5c9c36" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to set up pod "my-test-pod-fnjk7_default" network: [default/my-test-pod-fnjk7/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib4 GUID is not valid", failed to clean up sandbox container "777d1178ca6d8681b1f0f43780fb357c0dce74a6905c94337c2f07ef9a5c9c36" network for pod "my-test-pod-fnjk7": networkPlugin cni failed to teardown pod "my-test-pod-fnjk7_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name 777d1178ca6d8681b1f0f43780fb357c0dce74a6905c94337c2f07ef9a5c9c36-net1]
  Normal   AddedInterface          11s                multus             Add eth0 [10.42.0.30/32] from k8s-pod-network

The device plugin can detect the SRIOV net device on the host (node s-113-2-35 in my experiment), the output is shown in the following:

-MacBookPro:~/20-k8s-rdma-sriov/multus-cni/deployments$ kubectl get node s-113-2-35 -o json | jq '.status.allocatable'
{
  "cpu": "128",
  "ephemeral-storage": "5169411933432",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "mellanox.com/mlnx_sriov_rdma_ib": "4",
  "memory": "528110968Ki",
  "pods": "110"
}

NAD

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ib-sriov-network
  annotations:
    k8s.v1.cni.cncf.io/resourceName: mellanox.com/mlnx_sriov_rdma_ib
spec:
  config: '{
  "type": "ib-sriov",
  "cniVersion": "0.3.1",
  "name": "sriov-network",
  "ipam": {
    "type": "host-local",
    "subnet": "192.168.217.0/24",
    "routes": [{
      "dst": "0.0.0.0/0"
    }],
    "gateway": "192.168.217.1"
  }
}'

mutlus configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
        "resourceList": [{
                "resourcePrefix": "mellanox.com",
                "resourceName": "mlnx_sriov_rdma_ib",
                "selectors": {
                    "isRdma": true,
                    "vendors": ["15b3"],
                    "devices": ["101c"],
                    "drivers": ["mlx5_core"]
                }
            }
        ]
    }

sriov device plugin

n-MacBookPro:~/20-k8s-rdma-sriov/multus-cni/deployments$ kubectl -n kube-system logs kube-sriov-device-plugin-amd64-bpwlk
I1122 11:59:59.507695       1 manager.go:51] Using Kubelet Plugin Registry Mode
I1122 11:59:59.508691       1 main.go:44] resource manager reading configs
I1122 11:59:59.508739       1 manager.go:79] raw ResourceList: {
    "resourceList": [{
            "resourcePrefix": "mellanox.com",
            "resourceName": "mlnx_sriov_rdma_ib",
            "selectors": {
                "isRdma": true,
                "vendors": ["15b3"],
                "devices": ["101c"],
                "drivers": ["mlx5_core"]
            }
        }
    ]
}
I1122 11:59:59.508875       1 factory.go:166] net device selector for resource mlnx_sriov_rdma_ib is &{DeviceSelectors:{Vendors:[15b3] Devices:[101c] Drivers:[mlx5_core] PciAddresses:[]} PfNames:[] RootDevices:[] LinkTypes:[] DDPProfiles:[] IsRdma:true NeedVhostNet:false}
I1122 11:59:59.508902       1 manager.go:99] unmarshalled ResourceList: [{ResourcePrefix:mellanox.com ResourceName:mlnx_sriov_rdma_ib DeviceType:netDevice Selectors:0xc00000cd38 SelectorObj:0xc000375380}]
I1122 11:59:59.508960       1 manager.go:200] validating resource name "mellanox.com/mlnx_sriov_rdma_ib"
I1122 11:59:59.508968       1 main.go:60] Discovering host devices
I1122 11:59:59.589424       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c2:00.0 02              Intel Corporation    Ethernet Controller X710 for 10GbE SFP+
I1122 11:59:59.589938       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c2:00.1 02              Intel Corporation    Ethernet Controller X710 for 10GbE SFP+
I1122 11:59:59.590256       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.0 02              Mellanox Technolo... MT28908 Family [ConnectX-6]
I1122 11:59:59.591462       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.1 02              Mellanox Technolo... MT28908 Family [ConnectX-6]
I1122 11:59:59.591704       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.2 02              Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu...
I1122 11:59:59.591894       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.3 02              Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu...
I1122 11:59:59.592053       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.4 02              Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu...
I1122 11:59:59.592203       1 netDeviceProvider.go:84] netdevice AddTargetDevices(): device found: 0000:c3:00.5 02              Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu...
I1122 11:59:59.592383       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:01:00.0     12              unknown              unknown
I1122 11:59:59.592392       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:22:00.0     12              unknown              unknown
I1122 11:59:59.592397       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:41:00.0     12              unknown              unknown
I1122 11:59:59.592403       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:61:00.0     12              unknown              unknown
I1122 11:59:59.592407       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:81:00.0     12              unknown              unknown
I1122 11:59:59.592412       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:a1:00.0     12              unknown              unknown
I1122 11:59:59.592417       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:c1:00.0     12              unknown              unknown
I1122 11:59:59.592421       1 accelDeviceProvider.go:82] accelerator AddTargetDevices(): device found: 0000:e1:00.0     12              unknown              unknown
I1122 11:59:59.592429       1 main.go:66] Initializing resource servers
I1122 11:59:59.592731       1 manager.go:105] number of config: 1
I1122 11:59:59.592739       1 manager.go:109]
I1122 11:59:59.592742       1 manager.go:110] Creating new ResourcePool: mlnx_sriov_rdma_ib
I1122 11:59:59.592746       1 manager.go:111] DeviceType: netDevice
W1122 11:59:59.592779       1 pciNetDevice.go:55] RDMA resources for 0000:c2:00.0 not found. Are RDMA modules loaded?
I1122 11:59:59.593104       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c2:00.0. error getting devlink device attributes for net device 0000:c2:00.0 no such device
W1122 11:59:59.593215       1 pciNetDevice.go:55] RDMA resources for 0000:c2:00.1 not found. Are RDMA modules loaded?
I1122 11:59:59.593362       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c2:00.1. error getting devlink device attributes for net device 0000:c2:00.1 no such device
I1122 11:59:59.594005       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c3:00.1. <nil>
I1122 11:59:59.596385       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c3:00.2. <nil>
I1122 11:59:59.597465       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c3:00.3. <nil>
I1122 11:59:59.598273       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c3:00.4. <nil>
I1122 11:59:59.599262       1 utils.go:71] Devlink query for eswitch mode is not supported for device 0000:c3:00.5. <nil>
I1122 11:59:59.599408       1 factory.go:106] device added: [pciAddr: 0000:c3:00.2, vendor: 15b3, device: 101c, driver: mlx5_core]
I1122 11:59:59.599417       1 factory.go:106] device added: [pciAddr: 0000:c3:00.3, vendor: 15b3, device: 101c, driver: mlx5_core]
I1122 11:59:59.599423       1 factory.go:106] device added: [pciAddr: 0000:c3:00.4, vendor: 15b3, device: 101c, driver: mlx5_core]
I1122 11:59:59.599428       1 factory.go:106] device added: [pciAddr: 0000:c3:00.5, vendor: 15b3, device: 101c, driver: mlx5_core]
I1122 11:59:59.599446       1 manager.go:139] New resource server is created for mlnx_sriov_rdma_ib ResourcePool
I1122 11:59:59.599454       1 main.go:72] Starting all servers...
I1122 11:59:59.599885       1 server.go:199] starting mlnx_sriov_rdma_ib device plugin endpoint at: mellanox.com_mlnx_sriov_rdma_ib.sock
I1122 11:59:59.602783       1 server.go:226] mlnx_sriov_rdma_ib device plugin endpoint started serving
I1122 11:59:59.602805       1 main.go:77] All servers started.
I1122 11:59:59.602811       1 main.go:78] Listening for term signals
I1122 12:00:00.175755       1 server.go:110] Plugin: mellanox.com_mlnx_sriov_rdma_ib.sock gets registered successfully at Kubelet
I1122 12:00:00.175875       1 server.go:134] ListAndWatch(mlnx_sriov_rdma_ib) invoked
I1122 12:00:00.175890       1 server.go:142] ListAndWatch(mlnx_sriov_rdma_ib): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:c3:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:c3:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:c3:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:c3:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
I1122 12:04:42.983933       1 server.go:119] Allocate() called with &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[0000:c3:00.3],},},}
I1122 12:04:42.984024       1 netResourcePool.go:51] GetDeviceSpecs(): for devices: [0000:c3:00.3]
I1122 12:04:42.984044       1 pool_stub.go:97] GetEnvs(): for devices: [0000:c3:00.3]
I1122 12:04:42.984052       1 pool_stub.go:113] GetMounts(): for devices: [0000:c3:00.3]
I1122 12:04:42.984059       1 server.go:128] AllocateResponse send: &AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_MELLANOX_COM_MLNX_SRIOV_RDMA_IB: 0000:c3:00.3,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/issm3,HostPath:/dev/infiniband/issm3,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/umad3,HostPath:/dev/infiniband/umad3,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/uverbs3,HostPath:/dev/infiniband/uverbs3,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},Annotations:map[string]string{},},},}
I1122 12:22:33.340229       1 server.go:119] Allocate() called with &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[0000:c3:00.4],},},}
I1122 12:22:33.340326       1 netResourcePool.go:51] GetDeviceSpecs(): for devices: [0000:c3:00.4]
I1122 12:22:33.340347       1 pool_stub.go:97] GetEnvs(): for devices: [0000:c3:00.4]
I1122 12:22:33.340355       1 pool_stub.go:113] GetMounts(): for devices: [0000:c3:00.4]
I1122 12:22:33.340362       1 server.go:128] AllocateResponse send: &AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_MELLANOX_COM_MLNX_SRIOV_RDMA_IB: 0000:c3:00.4,},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/issm4,HostPath:/dev/infiniband/issm4,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/umad4,HostPath:/dev/infiniband/umad4,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/uverbs4,HostPath:/dev/infiniband/uverbs4,Permissions:rwm,},&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},Annotations:map[string]string{},},},}
jason-gideon commented 2 years ago

I print guid , it shows guid all 00. How to fix this?

n-MacBookPro:~/20-k8s-rdma-sriov/ib-sriov-cni/deployment/examples$ kubectl describe pod my-test-pod
Name:         my-test-pod
Namespace:    default
Priority:     0
Node:         s-113-2-35/10.113.2.35
Start Time:   Tue, 22 Nov 2022 22:02:12 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: dc4a26cafbe5e8d9ab86f863ec42735061cf67593330b8cdf54eac56451f3bfd
              cni.projectcalico.org/podIP:
              cni.projectcalico.org/podIPs:
              k8s.v1.cni.cncf.io/networks: [{"name": "ib-sriov-network"}]
Status:       Pending
IP:
IPs:          <none>
Containers:
  my-test-ctr:
    Container ID:
    Image:         mellanox/rping-test
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      sleep 1000000

    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      mellanox.com/mlnx_sriov_rdma_ib:  1
    Requests:
      mellanox.com/mlnx_sriov_rdma_ib:  1
    Environment:                        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jw2sr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-jw2sr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age        From               Message
  ----     ------                  ----       ----               -------
  Normal   Scheduled               <invalid>  default-scheduler  Successfully assigned default/my-test-pod to s-113-2-35
  Normal   AddedInterface          <invalid>  multus             Add eth0 [10.42.0.219/32] from k8s-pod-network
  Warning  FailedCreatePodSandBox  <invalid>  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "dc4a26cafbe5e8d9ab86f863ec42735061cf67593330b8cdf54eac56451f3bfd" network for pod "my-test-pod": networkPlugin cni failed to set up pod "my-test-pod_default" network: [default/my-test-pod/:sriov-network]: error adding container to network "sriov-network": infiniBand SRI-OV CNI failed to configure VF "VF ib2 GUID is not valid, HardwareAddr:00:00:00:e7:fe:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00, guid:00:00:00:00:00:00:00:00", failed to clean up sandbox container "dc4a26cafbe5e8d9ab86f863ec42735061cf67593330b8cdf54eac56451f3bfd" network for pod "my-test-pod": networkPlugin cni failed to teardown pod "my-test-pod_default" network: delegateDel: error invoking DelegateDel - "ib-sriov": error in getting result from DelNetwork: error reading cached NetConf in /var/lib/cni/ib-sriov with name dc4a26cafbe5e8d9ab86f863ec42735061cf67593330b8cdf54eac56451f3bfd-net1]
  Normal   SandboxChanged          <invalid>  kubelet            Pod sandbox changed, it will be killed and re-created.
zhutong196 commented 12 months ago

I meet the same question; you need first config vf node GUID and port GUID, Then use the command ibdev2netdev -v to check and display VF of status is up, and then you can use vf normally

image
cyclinder commented 10 months ago

Hey @zhutong196, Could you tell me how to configure the vf node GUID and port GUID?