k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
85 stars 114 forks source link

SRIOV PF got unbind instead of VF in case of IB link type #797

Closed heyvister1 closed 3 weeks ago

heyvister1 commented 4 weeks ago

Fixing daemon sriov VFs config, where PF pci address got unbind instead of allegedly VF address, in case of using IB link type.

While sriov config is applied for VF devices by SriovNodePolicy, it appears that PF pci address got unbind insead of VF address. This has caused SRIOV initialization failure.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ib-policy
  namespace: nvidia-network-operator
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: ib_vfs
  priority: 10
  numVfs: 2
  nicSelector:
    rootDevices:
      - '0000:03:00.0'
    vendor: 15b3 
  deviceType: netdevice    
  eSwitchMode: legacy
  linkType: ib 

sriov-config-daemon log snippet:

sriov/sriov.go:487  Unbind(): unbind device driver for device   {"device": "0000:03:00.0"}
2024-10-25T07:32:37.001173398Z  LEVEL(-2)   kernel/kernel.go:116    UnbindDriverByBusAndDevice(): unbind device driver for device   {"bus": "pci", "device": "0000:03:00.0"}
2024-10-25T07:32:37.00122385Z   LEVEL(-2)   kernel/kernel.go:228    getDriverByBusAndDevice(): driver for device    {"bus": "pci", "device": "0000:03:00.0", "driver": "../../../../bus/pci/drivers/mlx5_core"}
2024-10-25T07:32:37.001263521Z  LEVEL(-2)   kernel/kernel.go:236    unbindDriver(): unbind from driver  {"bus": "pci", "device": "0000:03:00.0", "driver": "mlx5_core"}

IB interfaces after mofed container has installed mlx5 drivers: (lshw)

Bus info          Device     Class          Description
=======================================================
pci@0000:01:00.0  eno1       network        I350 Gigabit Network Connection
pci@0000:01:00.1  eno2       network        I350 Gigabit Network Connection
pci@0000:03:00.0  ibp3s0f0   network        MT27800 Family [ConnectX-5]
pci@0000:03:00.1  ibp3s0f1   network        MT27800 Family [ConnectX-5]
pci@0000:81:00.0             network        MT27520 Family [ConnectX-3 Pro]

ibp3s0f0 IB interface is gone post node restart:

Bus info          Device     Class          Description
=======================================================
pci@0000:01:00.0  eno1       network        I350 Gigabit Network Connection
pci@0000:01:00.1  eno2       network        I350 Gigabit Network Connection
pci@0000:03:00.0             network        MT27800 Family [ConnectX-5]
pci@0000:03:00.1  ibp3s0f1   network        MT27800 Family [ConnectX-5]
pci@0000:81:00.0             network        MT27520 Family [ConnectX-3 Pro]
github-actions[bot] commented 4 weeks ago

Thanks for your PR, To run vendors CIs, Maintainers can use one of:

coveralls commented 4 weeks ago

Pull Request Test Coverage Report for Build 11541227557

Details


Files with Coverage Reduction New Missed Lines %
controllers/generic_network_controller.go 5 74.38%
<!-- Total: 5 -->
Totals Coverage Status
Change from base Build 11441395281: -0.03%
Covered Lines: 6656
Relevant Lines: 14799

💛 - Coveralls