k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
76 stars 104 forks source link

infiniBand SRI-OV CNI failed to configure VF "VF ib2 GUID is not valid" #307

Open seb-835 opened 2 years ago

seb-835 commented 2 years ago

Hi Team,

i think i am really near to get it work, but got this in describing my testing pod:

 Normal   AddedInterface          2s    multus             Add eth0 [10.233.117.195/32] from cni0
  Warning  FailedCreatePodSandBox  1s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "88aee8b5e04d60a0fdfe9437888521e2f49a170b67ce574adf571f05f644bf74": [default/test-sriov-ib-pod:example-sriov-ib-network]: error adding container to network "example-sriov-ib-network": infiniBand SRI-OV CNI failed to configure VF "VF ib2 GUID is not valid"

i use the following manifest to test :

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: infiniband-sriov
  namespace: cattle-sriov-system 
spec:
  deviceType: netdevice
  mtu: 1500
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  nicSelector:
    vendor: "15b3"
    deviceID: "101c"
  linkType: ib
  isRdma: true
  numVfs: 4 
  priority: 90
  resourceName: mlnxnics
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: example-sriov-ib-network
  namespace: cattle-sriov-system
spec:
  ipam: |
    {
     "type": "whereabouts",
     "range": "192.168.5.225/28"
    }
  resourceName: mlnxnics
  linkState: enable
  networkNamespace: default
kind: Pod
metadata:
  name: test-sriov-ib-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: example-sriov-ib-network
spec:
  containers:
    - name: test-sriov-ib-pod
      image: centos/tools
      imagePullPolicy: IfNotPresent
      command:
        - sh
        - -c
        - sleep inf
      securityContext:
        capabilities:
          add: [ "IPC_LOCK" ]
      resources:
        requests:
         rancher.io/mlnxnics: "1"
        limits:
          rancher.io/mlnxnics: "1

can you give me advice to fix it ? Thanks a lot

SchSeba commented 1 year ago

Hi @e0ne @seb-835 any update on this issue or we can close it?

seb-835 commented 1 year ago

No update on this case, still having the issue. Any help appreciate to soldve it.

adrianchiris commented 1 year ago

Greeting!

After node sriov configuration via config daemon and before scheduling an IB workload on the node what are the VFs hardware address ?

it seems they are all zeroes or ones according to CNI failure

https://github.com/k8snetworkplumbingwg/ib-sriov-cni/blob/5473e6b97fa532233221a5e2ee06aa182457ffc0/pkg/sriov/sriov.go#L259

what OS and kernel are you using ? maybe the kernel does not support get/set of VF port and node guid

adrianchiris commented 1 year ago

in sriov-network-config-daemon logs do you see error after: : "setVfGuid()" log msg ?

can you add sriov-network-config-daemon logs when it tries to configure sriov for the node ?

fu7100 commented 1 year ago

I also have the same problem, I don't know how to solve it, does anyone know how to solve it, please contact me, my email is fu7100@gmail.com

SchSeba commented 1 year ago

Hi @seb-835 @fu7100 any update on this issue we are waiting for some logs. If you manage to make it work let me know I will close this issue thanks!

frye233 commented 1 year ago

I have the same question. "infiniBand SRI-OV CNI failed to configure VF "VF ib9 GUID is not valid""。 I solved this problem by manually configuring the node, port and policy of VF. However, I am puzzled that the plug-in should automatically configure the relevant information of VF, instead of requiring me to configure it manually. What is the reason for this? Can you help me solve it? Thank you very much.

cumulus-joeyyang commented 11 months ago

I have the same question. "infiniBand SRI-OV CNI failed to configure VF "VF ib9 GUID is not valid""。 I solved this problem by manually configuring the node, port and policy of VF. However, I am puzzled that the plug-in should automatically configure the relevant information of VF, instead of requiring me to configure it manually. What is the reason for this? Can you help me solve it? Thank you very much.

Hi @frye233

Could you tell me how did you manually configure the node/port GUID of VF? I have the same issue with raw ib-sriov cni and dp deployment. In addition, the VF I created all remain DOWN and I don't know how to bring them up though the ib PF is UP. thanks.

SchSeba commented 6 months ago

any update on this issue or we can close it?