k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
84 stars 114 forks source link

SR-IOV Network Operator 4.15.0-202410010035 | when setting linkType: IB the NIC get filtered out #795

Open bbenshab opened 4 weeks ago

bbenshab commented 4 weeks ago

when setting: linkType: IB on a SriovNetworkNodePolicy like in this example:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnx-port-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  ibverbs: true
  isRdma: true
  linkType: IB
  nicSelector:
    pfNames:
    - ibs3f0
    vendor: 15b3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  numVfs: 1
  priority: 99
  rdma: true
  resourceName: port1

the NIC gets filtered out as shown below openshift.io/port1= 0

oc get node intel-perf-27.perf.eng.bos2.dc.redhat.com -o json | jq .status.allocatable
{
  "cpu": "127500m",
  "ephemeral-storage": "213881594729",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "526694060Ki",
  "nvidia.com/gpu": "2",
  "openshift.io/port1": "0",
  "pods": "250",
  "rdma/rdma_shared_device_a": "63"
}

the only workaround I found is to edit the config map: oc edit configmap -n openshift-sriov-network-operator device-plugin-config

and then removing: "linkTypes":["Infiniband"],

from:

apiVersion: v1
data:
  intel-perf-27.perf.eng.bos2.dc.redhat.com: '{"resourceList":[{"resourceName":"port1","selectors":{"vendors":["15b3"],"pfNames":["ibs3f0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"port2","selectors":{"vendors":["15b3"],"pfNames":["ibs3f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
  perf-intel-6.perf.eng.bos2.dc.redhat.com: '{"resourceList":[{"resourceName":"port1","selectors":{"vendors":["15b3"],"pfNames":["ibs3f0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"port2","selectors":{"vendors":["15b3"],"pfNames":["ibs3f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
kind: ConfigMap

however it get resets every 300 seconds.

for reference:
NetworkAttachmentDefinition:

apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/port1 name: network-port-1 namespace: default spec: config: "{\n \"cniVersion\": \"0.3.1\",\n \"name\": \"network-port-1\",\n\ \ \"type\": \"ib-sriov\",\n \"logLevel\": \"info\",\n \"ipam\": {\n \ \ \"type\": \"whereabouts\",\n \"range\": \"192.168.1.2/24\",\n \ \ \"exclude\": [\n \"192.168.1.1\",\n \"192.168.1.2\"\ ,\n \"192.168.1.254\",\n \"192.168.1.255\"\n ],\n\ \ \"routes\": [\n {\n \"dst\": \"192.168.1.0/24\"\ \n }\n ]\n }\n}"


SriovIBNetwork:

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: sriov-ib-network-port-1 namespace: openshift-sriov-network-operator spec: pfNames:

zeeke commented 4 weeks ago

hi @bbenshab. can you please attach sriov logs and resources? Since it is an openshift cluster, you can get them with:

oc adm must-gather -- /usr/bin/gather_sriov

Also, you stated this happens on 4.15.0, any chance you can reproduce this issue with the latest sriov-network-operator version?

adrianchiris commented 3 weeks ago

lets try to reproduce with sriov-network-operator from this repo.

also it may be related to : https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/797