k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
84 stars 114 forks source link

All images use 'latest` when installing v1.4.0 version #784

Closed jslouisyou closed 1 month ago

jslouisyou commented 1 month ago

Hello,

When I try to deploy sriov-network-operator latest version (e.g. v1.4.0), it seems that sriov-network-operator creates sriov-network-config-daemon Pods accordingly but I can find that all images within that Pod uses latest tag (Tag is omitted but I heard that if this tag has been omitted then latest will be used by default, please let me know if I knew wrong).

I can find that image tags aren't assigned inside the script:

  1. Helm chart https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/9dbf2b1b5fd1ff7e836aff169b8aabf020a2840e/deployment/sriov-network-operator-chart/values.yaml#L104-L115

  2. Shell script https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/9dbf2b1b5fd1ff7e836aff169b8aabf020a2840e/hack/env.sh#L1-L14

But in v1.2.0, image tags were set: https://github.com/k8snetworkplumbingwg/sriov-network-operator/blob/815fd134ba8000756791051fca60179ec66ddb46/hack/env.sh#L1-L20

In this case, Is it intended to use latest image for all containers? If not, could you please proper tags for all images?

Thanks.

jslouisyou commented 1 month ago

Hi, I could pull Helm chart like: helm pull oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart --version 1.4.0 and it seems that image tags for all containers are all set.

images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
  rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
  metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
  metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

I think I can use this images for v1.4.0. Could you please confirm that above images are setup correctly?

zeeke commented 1 month ago

hi @jslouisyou , I think the point here is that we can no longer deploy a tag by checking out the source code. Since the helm package is deployed when tagging, I think having the helm pull ... command is enough for the job.

Images look correct to me.

Are you experiencing any other issue during the deploy?

jslouisyou commented 1 month ago

Hi @zeeke , I'm facing an issue while creating VFs in v1.4.0 version - IB devices disappears at the end of VF creation (It works in v1.3.0 btw). First of all, I think below comment is quite different from this thread so please let me know if I need to create another issue then.

I used same configuration (e.g. SriovNetworkNodePolicy) for creating VFs.

Here's SriovNetworkNodePolicy that I used:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib2
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp157s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib3
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp211s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib3

And I'm using H100 node with ConnectX-7 IB:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0   mlx5_4          net-ibp211s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0   mlx5_2          net-ibp157s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0   
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0

$ lspci -s 41:00.0 -vvn
41:00.0 0207: 15b3:1021
    Subsystem: 15b3:0041
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 18
    NUMA node: 0
    Region 0: Memory at 23e044000000 (64-bit, prefetchable) [size=32M]
    Expansion ROM at <ignored> [disabled]
    Capabilities: [60] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
            MaxPayload 256 bytes, MaxReadReq 4096 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 32GT/s, Width x16, ASPM not supported
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 32GT/s (ok), Width x16 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
        DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn+
        LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [48] Vital Product Data
        Product Name: Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter
        Read-only fields:
            [PN] Part number: 0RYMTY
            [EC] Engineering changes: A02
            [MN] Manufacture ID: 1028
            [SN] Serial number: IN0RYMTYJBNM43BRJ4KF
            [VA] Vendor specific: DSV1028VPDR.VER2.1
            [VB] Vendor specific: FFV28.39.10.02
            [VC] Vendor specific: NPY1
            [VD] Vendor specific: PMTD
            [VE] Vendor specific: NMVNvidia, Inc.
            [VH] Vendor specific: L1D0
            [VU] Vendor specific: IN0RYMTYJBNM43BRJ4KFMLNXS0D0F0 
            [RV] Reserved: checksum good, 0 byte(s) reserved
        End
    Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
        Vector table: BAR=0 offset=00002000
        PBA: BAR=0 offset=00003000
    Capabilities: [c0] Vendor Specific Information: Len=18 <?>
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
        CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
        AERCap: First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [1c0 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [320 v1] Lane Margining at the Receiver <?>
    Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [3b0 v1] Extended Capability ID 0x2a
    Capabilities: [420 v1] Data Link Feature <?>
    Kernel driver in use: mlx5_core
    Kernel modules: mlx5_core

And I pulled v1.3.0 and v1.4.0 Helm charts from oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart and image tags are different:

  1. v1.3.0

    images:
    operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.3.0
    sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.3.0
    sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.0
    ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
    ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.0
    sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
    resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
    webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.3.0
  2. v1.4.0

    images:
    operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
    sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
    sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
    ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
    ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
    rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
    sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
    resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
    webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
    metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
    metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

As you know, sriov-device-plugin pods are creating when SriovNetworkNodePolicy deployed. After then, my H100 nodes' status are changed from sriovnetwork.openshift.io/state: Idle to sriovnetwork.openshift.io/state: Reboot_Required and rebooted after elapsed some time.

But in v1.4.0, it seems that VFs were created but eventually these were not shown and even PF disappeared. Here's the logs from dmesg:

[  115.692158] pci 0000:41:00.1: [15b3:101e] type 00 class 0x020700
[  115.692321] pci 0000:41:00.1: enabling Extended Tags
[  115.694112] mlx5_core 0000:41:00.1: enabling device (0000 -> 0002)
[  115.694789] mlx5_core 0000:41:00.1: firmware version: 28.39.1002
[  115.867939] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.867943] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.892812] pci 0000:41:00.2: [15b3:101e] type 00 class 0x020700
[  115.892967] pci 0000:41:00.2: enabling Extended Tags
[  115.894706] mlx5_core 0000:41:00.2: enabling device (0000 -> 0002)
[  115.895344] mlx5_core 0000:41:00.2: firmware version: 28.39.1002
[  115.895423] mlx5_core 0000:41:00.1 ibp65s0v0: renamed from ib0
[  116.065557] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.065561] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.090478] pci 0000:41:00.3: [15b3:101e] type 00 class 0x020700
[  116.090634] pci 0000:41:00.3: enabling Extended Tags
[  116.093559] mlx5_core 0000:41:00.3: enabling device (0000 -> 0002)
[  116.093993] mlx5_core 0000:41:00.2 ibp65s0v1: renamed from ib0
[  116.094189] mlx5_core 0000:41:00.3: firmware version: 28.39.1002
[  116.293582] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.293587] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.318209] pci 0000:41:00.4: [15b3:101e] type 00 class 0x020700
[  116.318368] pci 0000:41:00.4: enabling Extended Tags
[  116.320079] mlx5_core 0000:41:00.4: enabling device (0000 -> 0002)
[  116.320712] mlx5_core 0000:41:00.4: firmware version: 28.39.1002
[  116.320871] mlx5_core 0000:41:00.3 ibp65s0v2: renamed from ib0
.....
[  446.036867] mlx5_core 0000:41:01.0 ibp65s0v7: renamed from ib0
[  446.464555] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:898:(pid 6868): Skipping wait for vf pages stage
[  448.848149] mlx5_core 0000:41:00.0: driver left SR-IOV enabled after remove                                               <----------- weird
[  449.108562] mlx5_core 0000:41:00.2: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108602] mlx5_core 0000:41:00.4: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108620] mlx5_core 0000:41:00.2: mlx5_health_try_recover:375:(pid 1478): handling bad device here
[  449.108627] mlx5_core 0000:41:00.2: mlx5_handle_bad_state:326:(pid 1478): starting teardown
[  449.108629] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:277:(pid 1478): start
[  449.108646] mlx5_core 0000:41:00.4: mlx5_health_try_recover:375:(pid 2283): handling bad device here
[  449.108660] mlx5_core 0000:41:00.4: mlx5_handle_bad_state:326:(pid 2283): starting teardown
[  449.108661] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:277:(pid 2283): start
[  449.108672] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:310:(pid 1478): end
[  449.108694] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:310:(pid 2283): end
[  449.876577] mlx5_core 0000:41:00.5: poll_health:955:(pid 0): Fatal error 3 detected
[  449.876642] mlx5_core 0000:41:00.5: mlx5_health_try_recover:375:(pid 1000): handling bad device here
[  449.876649] mlx5_core 0000:41:00.5: mlx5_handle_bad_state:326:(pid 1000): starting teardown
[  449.876651] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:277:(pid 1000): start
[  449.877266] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:310:(pid 1000): end
[  450.381036] mlx5_core 0000:41:00.2: mlx5_health_try_recover:381:(pid 1478): starting health recovery flow

After then, when I tried to execute mst status -v then even the node can't find PF itself:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0                             1                                  <---- it goes empty
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0                             1                                  <---- it goes empty     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0 

Do you know anything about this situation? Anything would be very helpful.

Thanks.

zeeke commented 1 month ago

Yes, please pack all this information in a new issue. It will help other users to better find the information

jslouisyou commented 1 month ago

Thanks @zeeke . I'll wrap this up and raise a new issue then.