open-mpi / hwloc

Hardware locality (hwloc)
https://www.open-mpi.org/projects/hwloc
Other
565 stars 173 forks source link

NVIDIA PCI Gen4 link speed from NVML is wrong #653

Closed aksbaih closed 6 months ago

aksbaih commented 6 months ago

https://github.com/open-mpi/hwloc/blob/39fae7e3151fbd677f953f6990acd9eb2a0b9bfb/hwloc/topology-pci.c#L313

bgoglin commented 6 months ago

Hello. I've never heard of any device with two such capabilities. What does it mean and how are we supposed to know which one is good? Does Linux even support that?

bgoglin commented 6 months ago

I couldn't find anything in the Linux kernel PCI code that handles such cases. They seem to always take the first found capability.

aksbaih commented 6 months ago

Here's an example: Nvidia A100

da:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 40GB] [10de:20b0] (rev a1)
    Subsystem: NVIDIA Corporation Device [10de:134f]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 87
    NUMA node: 5
    Region 0: Memory at c5000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at 52000000000 (64-bit, prefetchable) [size=64G]
    Region 3: Memory at 53420000000 (64-bit, prefetchable) [size=32M]
    Capabilities: [60] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [68] Null
    Capabilities: [78] Express (v2) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
            ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
        DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
            RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
        LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 16GT/s (ok), Width x16 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS- TPHComp- ExtTPHComp-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
             AtomicOpsCtl: ReqEn-
        LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
        LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
             EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
        Vector table: BAR=0 offset=00b90000
        PBA: BAR=0 offset=00ba0000
    Capabilities: [100 v1] Virtual Channel
        Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:   ArbSelect=Fixed
        Status: InProgress-
        VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
            Status: NegoPending- InProgress-
    Capabilities: [258 v1] L1 PM Substates
        L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
              PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
        L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
               T_CommonMode=0us LTR1.2_Threshold=0ns
        L1SubCtl2: T_PwrOn=10us
    Capabilities: [128 v1] Power Budgeting <?>
    Capabilities: [420 v2] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
        AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
            MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
        HeaderLog: 00000000 00000000 00000000 00000000
    Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900 v1] Secondary PCI Express
        LnkCtl3: LnkEquIntrruptEn- PerformEqu-
        LaneErrStat: 0
    Capabilities: [bb0 v1] Physical Resizable BAR
        BAR 0: current size: 16MB, supported: 16MB
        BAR 1: current size: 64GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB
        BAR 3: current size: 32MB, supported: 32MB
    Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
        IOVCap: Migration-, Interrupt Message Number: 000
        IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
        IOVSta: Migration-
        Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
        VF offset: 4, stride: 1, Device ID: 20b0
        Supported Page Size: 00000573, System Page Size: 00000001
        Region 0: Memory at c6000000 (32-bit, non-prefetchable)
        Region 1: Memory at 0000053000000000 (64-bit, prefetchable)
        Region 3: Memory at 0000053400000000 (64-bit, prefetchable)
        VF Migration: offset: 00000000, BIR: 0
    Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
        ARICap: MFVC- ACS-, Next Function: 0
        ARICtl: MFVC- ACS-, Function Group: 0
    Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
    Capabilities: [d00 v1] Lane Margining at the Receiver <?>
    Capabilities: [e00 v1] Data Link Feature <?>
    Kernel driver in use: nvidia
    Kernel modules: nouveau, nvidia_drm, nvidia

Notice the LnkSta2. Your tool shows a link speed of 16 GB/s instead of 32 GB/s

bgoglin commented 6 months ago

Ok, I see, LnkSta2 is not a second instance PCI_CAP_ID_EXP, it's from the another capability (PCI_CAP_LNKCAP2) that reports the maximal speed supported by the device, not the current one. If you card supports PCIe Gen3 16x and you put it in a Gen4 8x port, this capability will report something higher than the actual max link speed in this platform. That's why we don't use this capability but the "current link speed" given in PCI_CAP_ID_EXP. Unfortunatelly that one isn't perfect either because lots of modern GPUs can decrease the link speed when idle, hence you may get something lower depending on what's running on the GPU when you call lspci or lstopo. Make sure the GPU status doesn't change between tests.

In your case, lspci says 16x lane 16GT/s (PCIe Gen4), that's indeed 32GB/s. I don't know why hwloc gets 16GB/s if the idleness of the device didn't change between lspci and lstopo. hwloc has 2 ways to get that info, either directly using PCI config space or through Linux. Try setting HWLOC_COMPONENTS=-pci or =pci in the environment to avoid the PCI config space and force reading Linux info instead. Do you get a different result?

aksbaih commented 6 months ago
PCIBridge L#21 (busid=0000:50:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[51-51])
                              PCI L#6 (busid=0000:51:00.0 id=10de:20b0 class=0302(3D) link=15.75GB/s)
                                Co-Processor(CUDA) L#9 (Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" CUDAGlobalMemorySize=41486144 CUDAL2CacheSize=40960 CUDAMultiProcessors=108 CUDACoresPerMP=64 CUDASharedMemorySizePerMP=48) "cuda2"
                                Co-Processor(OpenCL) L#10 (Backend=OpenCL OpenCLDeviceType=GPU GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" OpenCLPlatformIndex=0 OpenCLPlatformName="NVIDIA CUDA" OpenCLPlatformDeviceIndex=2 OpenCLComputeUnits=108 OpenCLGlobalMemorySize=41486144) "opencl0d2"
                                GPU(NVML) L#11 (Backend=NVML GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" NVIDIASerial=1320921008257 NVIDIAUUID=GPU-063c9b30-62c8-3e20-b0a5-370bc4e5745d) "nvml2"
                        PCIBridge L#22 (busid=0000:4e:10.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[52-55])
                          PCIBridge L#23 (busid=0000:52:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[53-55])
                            PCIBridge L#24 (busid=0000:53:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[54-54])
                              PCI L#7 (busid=0000:54:00.0 id=10de:20b0 class=0302(3D) link=15.75GB/s)
                                Co-Processor(CUDA) L#12 (Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" CUDAGlobalMemorySize=41486144 CUDAL2CacheSize=40960 CUDAMultiProcessors=108 CUDACoresPerMP=64 CUDASharedMemorySizePerMP=48) "cuda3"
                                Co-Processor(OpenCL) L#13 (Backend=OpenCL OpenCLDeviceType=GPU GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" OpenCLPlatformIndex=0 OpenCLPlatformName="NVIDIA CUDA" OpenCLPlatformDeviceIndex=3 OpenCLComputeUnits=108 OpenCLGlobalMemorySize=41486144) "opencl0d3"

Still reports 16GB/s with all variations of

HWLOC_COMPONENTS=-pci
HWLOC_COMPONENTS==pci
HWLOC_COMPONENTS=linux,stop
HWLOC_COMPONENTS=pci,stop

Both with and without an IO-heavy workload on all the GPUs. Thanks!

bgoglin commented 6 months ago

Looks like the Linux kernel reports 16GB/s too (these values are just read from /sys/bus/pci/devices/<bdf>/current_link_* where "bdf" is something like 0000:51:00.0). I noticed that the lstopo output above shows 32GB/s in the PCI bridge above the GPUs and 16 in each GPU, strange if they are not idle.

aksbaih commented 6 months ago
cat /sys/bus/pci/devices/0000:51:00.0/current_link_speed
16.0 GT/s PCIe
cat /sys/bus/pci/devices/0000:51:00.0/current_link_width
16

When idle or not. 16x 16GT/s is 32GB/s.

bgoglin commented 6 months ago

Oh I think I know, we also have some NVML code to detect that bandwidth, and that code was forgotten when PCIe Gen>3 was added, only Linux and PCI were factorized and updated. Try "export HWLOC_COMPONENTS=-nvml" to disable the buggy NVML code. And this patch should fix it (but I'll apply a different one to factorize this with the Linux and PCI code):

--- a/hwloc/topology-nvml.c
+++ b/hwloc/topology-nvml.c
@@ -257,7 +257,7 @@ hwloc_nvml_discover(struct hwloc_backend *backend, struct hwloc_disc_status *dst
         * PCIe Gen2 = 5  GT/s signal-rate per lane with 8/10 encoding    = 0.5 GB/s data-rate per lane
         * PCIe Gen3 = 8  GT/s signal-rate per lane with 128/130 encoding = 1   GB/s data-rate per lane
         */
-       lanespeed = maxgen <= 2 ? 2.5 * maxgen * 0.8 : 8.0 * 128/130; /* Gbit/s per lane */
+       lanespeed = maxgen <= 2 ? 2.5 * maxgen * 0.8 : 8.0 * (1<<(maxgen-3)) * 128/130; /* Gbit/s per lane */
        if (lanespeed * maxwidth != 0.)
          /* we found the max link speed, replace the current link speed found by pci (or none) */
          parent->attr->pcidev.linkspeed = lanespeed * maxwidth / 8; /* GB/s */
aksbaih commented 6 months ago

HWLOC_COMPONENTS=-nvml fixes it. Thank you!

bgoglin commented 3 months ago

I am posting 2.11rc1 right now with the proper fix.