k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
85 stars 114 forks source link

VFs are not created in v1.4.0 #786

Open jslouisyou opened 1 month ago

jslouisyou commented 1 month ago

Hi, I'm facing an issue while creating VFs in v1.4.0 version - IB devices disappears at the end of VF creation (It works in v1.3.0 btw).

I used same configuration (e.g. SriovNetworkNodePolicy) for creating VFs.

Here's SriovNetworkNodePolicy that I used:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib2
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp157s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib3
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp211s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib3

And I'm using H100 node with ConnectX-7 IB:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0   mlx5_4          net-ibp211s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0   mlx5_2          net-ibp157s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0   
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0

$ lspci -s 41:00.0 -vvn
41:00.0 0207: 15b3:1021
    Subsystem: 15b3:0041
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 18
    NUMA node: 0
    Region 0: Memory at 23e044000000 (64-bit, prefetchable) [size=32M]
    Expansion ROM at <ignored> [disabled]
    Capabilities: [60] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 4096 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM not supported
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 32GT/s (ok), Width x16 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
        DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn+
        LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [48] Vital Product Data
        Product Name: Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter
        Read-only fields:
            [PN] Part number: 0RYMTY
            [EC] Engineering changes: A02
            [MN] Manufacture ID: 1028
            [SN] Serial number: IN0RYMTYJBNM43BRJ4KF
            [VA] Vendor specific: DSV1028VPDR.VER2.1
            [VB] Vendor specific: FFV28.39.10.02
            [VC] Vendor specific: NPY1
            [VD] Vendor specific: PMTD
            [VE] Vendor specific: NMVNvidia, Inc.
            [VH] Vendor specific: L1D0
            [VU] Vendor specific: IN0RYMTYJBNM43BRJ4KFMLNXS0D0F0 
            [RV] Reserved: checksum good, 0 byte(s) reserved
        End
    Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Vector table: BAR=0 offset=00002000
        PBA: BAR=0 offset=00003000
    Capabilities: [c0] Vendor Specific Information: Len=18 <?>
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
        AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [1c0 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [320 v1] Lane Margining at the Receiver <?>
    Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [3b0 v1] Extended Capability ID 0x2a
    Capabilities: [420 v1] Data Link Feature <?>
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core

And I pulled v1.3.0 and v1.4.0 Helm charts from oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart and image tags are different:

  1. v1.3.0

    images:
    operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.3.0
    sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.3.0
    sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.0
    ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
    ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.0
    sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
    resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
    webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.3.0
  2. v1.4.0

    images:
    operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
    sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
    sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
    ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
    ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
    rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
    sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
    resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
    webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
    metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
    metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

As you know, sriov-device-plugin pods are creating when SriovNetworkNodePolicy deployed. After then, my H100 nodes' status are changed from sriovnetwork.openshift.io/state: Idle to sriovnetwork.openshift.io/state: Reboot_Required and rebooted after elapsed some time.

But in v1.4.0, it seems that VFs were created but eventually these were not shown and even PF disappeared. Here's the logs from dmesg:

[  115.692158] pci 0000:41:00.1: [15b3:101e] type 00 class 0x020700
[  115.692321] pci 0000:41:00.1: enabling Extended Tags
[  115.694112] mlx5_core 0000:41:00.1: enabling device (0000 -> 0002)
[  115.694789] mlx5_core 0000:41:00.1: firmware version: 28.39.1002
[  115.867939] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.867943] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.892812] pci 0000:41:00.2: [15b3:101e] type 00 class 0x020700
[  115.892967] pci 0000:41:00.2: enabling Extended Tags
[  115.894706] mlx5_core 0000:41:00.2: enabling device (0000 -> 0002)
[  115.895344] mlx5_core 0000:41:00.2: firmware version: 28.39.1002
[  115.895423] mlx5_core 0000:41:00.1 ibp65s0v0: renamed from ib0
[  116.065557] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.065561] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.090478] pci 0000:41:00.3: [15b3:101e] type 00 class 0x020700
[  116.090634] pci 0000:41:00.3: enabling Extended Tags
[  116.093559] mlx5_core 0000:41:00.3: enabling device (0000 -> 0002)
[  116.093993] mlx5_core 0000:41:00.2 ibp65s0v1: renamed from ib0
[  116.094189] mlx5_core 0000:41:00.3: firmware version: 28.39.1002
[  116.293582] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.293587] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.318209] pci 0000:41:00.4: [15b3:101e] type 00 class 0x020700
[  116.318368] pci 0000:41:00.4: enabling Extended Tags
[  116.320079] mlx5_core 0000:41:00.4: enabling device (0000 -> 0002)
[  116.320712] mlx5_core 0000:41:00.4: firmware version: 28.39.1002
[  116.320871] mlx5_core 0000:41:00.3 ibp65s0v2: renamed from ib0
.....
[  446.036867] mlx5_core 0000:41:01.0 ibp65s0v7: renamed from ib0
[  446.464555] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:898:(pid 6868): Skipping wait for vf pages stage
[  448.848149] mlx5_core 0000:41:00.0: driver left SR-IOV enabled after remove                                               <----------- weird
[  449.108562] mlx5_core 0000:41:00.2: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108602] mlx5_core 0000:41:00.4: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108620] mlx5_core 0000:41:00.2: mlx5_health_try_recover:375:(pid 1478): handling bad device here
[  449.108627] mlx5_core 0000:41:00.2: mlx5_handle_bad_state:326:(pid 1478): starting teardown
[  449.108629] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:277:(pid 1478): start
[  449.108646] mlx5_core 0000:41:00.4: mlx5_health_try_recover:375:(pid 2283): handling bad device here
[  449.108660] mlx5_core 0000:41:00.4: mlx5_handle_bad_state:326:(pid 2283): starting teardown
[  449.108661] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:277:(pid 2283): start
[  449.108672] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:310:(pid 1478): end
[  449.108694] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:310:(pid 2283): end
[  449.876577] mlx5_core 0000:41:00.5: poll_health:955:(pid 0): Fatal error 3 detected
[  449.876642] mlx5_core 0000:41:00.5: mlx5_health_try_recover:375:(pid 1000): handling bad device here
[  449.876649] mlx5_core 0000:41:00.5: mlx5_handle_bad_state:326:(pid 1000): starting teardown
[  449.876651] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:277:(pid 1000): start
[  449.877266] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:310:(pid 1000): end
[  450.381036] mlx5_core 0000:41:00.2: mlx5_health_try_recover:381:(pid 1478): starting health recovery flow

** Above messages shown when I pointed out ibp65s0 to create VFs. Sorry for confusion. This behavior happens regardless of PF names.

After then, when I tried to execute mst status -v then even the node can't find PF itself:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0                             1                                  <---- it goes empty
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0                             1                                  <---- it goes empty     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0 

Do you know anything about this situation? Anything would be very helpful.

Thanks.

adrianchiris commented 3 weeks ago

maybe its related to: https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/797 ?