Closed aksbaih closed 6 months ago
Hello. I've never heard of any device with two such capabilities. What does it mean and how are we supposed to know which one is good? Does Linux even support that?
I couldn't find anything in the Linux kernel PCI code that handles such cases. They seem to always take the first found capability.
Here's an example: Nvidia A100
da:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 SXM4 40GB] [10de:20b0] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:134f]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 87
NUMA node: 5
Region 0: Memory at c5000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at 52000000000 (64-bit, prefetchable) [size=64G]
Region 3: Memory at 53420000000 (64-bit, prefetchable) [size=32M]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Null
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [c8] MSI-X: Enable+ Count=6 Masked-
Vector table: BAR=0 offset=00b90000
PBA: BAR=0 offset=00ba0000
Capabilities: [100 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [258 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [420 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [bb0 v1] Physical Resizable BAR
BAR 0: current size: 16MB, supported: 16MB
BAR 1: current size: 64GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB
BAR 3: current size: 32MB, supported: 32MB
Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 16, Total VFs: 16, Number of VFs: 0, Function Dependency Link: 00
VF offset: 4, stride: 1, Device ID: 20b0
Supported Page Size: 00000573, System Page Size: 00000001
Region 0: Memory at c6000000 (32-bit, non-prefetchable)
Region 1: Memory at 0000053000000000 (64-bit, prefetchable)
Region 3: Memory at 0000053400000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
Capabilities: [d00 v1] Lane Margining at the Receiver <?>
Capabilities: [e00 v1] Data Link Feature <?>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
Notice the LnkSta2. Your tool shows a link speed of 16 GB/s instead of 32 GB/s
Ok, I see, LnkSta2 is not a second instance PCI_CAP_ID_EXP, it's from the another capability (PCI_CAP_LNKCAP2) that reports the maximal speed supported by the device, not the current one. If you card supports PCIe Gen3 16x and you put it in a Gen4 8x port, this capability will report something higher than the actual max link speed in this platform. That's why we don't use this capability but the "current link speed" given in PCI_CAP_ID_EXP. Unfortunatelly that one isn't perfect either because lots of modern GPUs can decrease the link speed when idle, hence you may get something lower depending on what's running on the GPU when you call lspci or lstopo. Make sure the GPU status doesn't change between tests.
In your case, lspci says 16x lane 16GT/s (PCIe Gen4), that's indeed 32GB/s. I don't know why hwloc gets 16GB/s if the idleness of the device didn't change between lspci and lstopo. hwloc has 2 ways to get that info, either directly using PCI config space or through Linux. Try setting HWLOC_COMPONENTS=-pci or =pci in the environment to avoid the PCI config space and force reading Linux info instead. Do you get a different result?
PCIBridge L#21 (busid=0000:50:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[51-51])
PCI L#6 (busid=0000:51:00.0 id=10de:20b0 class=0302(3D) link=15.75GB/s)
Co-Processor(CUDA) L#9 (Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" CUDAGlobalMemorySize=41486144 CUDAL2CacheSize=40960 CUDAMultiProcessors=108 CUDACoresPerMP=64 CUDASharedMemorySizePerMP=48) "cuda2"
Co-Processor(OpenCL) L#10 (Backend=OpenCL OpenCLDeviceType=GPU GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" OpenCLPlatformIndex=0 OpenCLPlatformName="NVIDIA CUDA" OpenCLPlatformDeviceIndex=2 OpenCLComputeUnits=108 OpenCLGlobalMemorySize=41486144) "opencl0d2"
GPU(NVML) L#11 (Backend=NVML GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" NVIDIASerial=1320921008257 NVIDIAUUID=GPU-063c9b30-62c8-3e20-b0a5-370bc4e5745d) "nvml2"
PCIBridge L#22 (busid=0000:4e:10.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[52-55])
PCIBridge L#23 (busid=0000:52:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[53-55])
PCIBridge L#24 (busid=0000:53:00.0 id=1000:c010 class=0604(PCIBridge) link=31.51GB/s buses=0000:[54-54])
PCI L#7 (busid=0000:54:00.0 id=10de:20b0 class=0302(3D) link=15.75GB/s)
Co-Processor(CUDA) L#12 (Backend=CUDA GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" CUDAGlobalMemorySize=41486144 CUDAL2CacheSize=40960 CUDAMultiProcessors=108 CUDACoresPerMP=64 CUDASharedMemorySizePerMP=48) "cuda3"
Co-Processor(OpenCL) L#13 (Backend=OpenCL OpenCLDeviceType=GPU GPUVendor="NVIDIA Corporation" GPUModel="NVIDIA A100-SXM4-40GB" OpenCLPlatformIndex=0 OpenCLPlatformName="NVIDIA CUDA" OpenCLPlatformDeviceIndex=3 OpenCLComputeUnits=108 OpenCLGlobalMemorySize=41486144) "opencl0d3"
Still reports 16GB/s with all variations of
HWLOC_COMPONENTS=-pci
HWLOC_COMPONENTS==pci
HWLOC_COMPONENTS=linux,stop
HWLOC_COMPONENTS=pci,stop
Both with and without an IO-heavy workload on all the GPUs. Thanks!
Looks like the Linux kernel reports 16GB/s too (these values are just read from /sys/bus/pci/devices/<bdf>/current_link_*
where "bdf" is something like 0000:51:00.0).
I noticed that the lstopo output above shows 32GB/s in the PCI bridge above the GPUs and 16 in each GPU, strange if they are not idle.
cat /sys/bus/pci/devices/0000:51:00.0/current_link_speed
16.0 GT/s PCIe
cat /sys/bus/pci/devices/0000:51:00.0/current_link_width
16
When idle or not. 16x 16GT/s is 32GB/s.
Oh I think I know, we also have some NVML code to detect that bandwidth, and that code was forgotten when PCIe Gen>3 was added, only Linux and PCI were factorized and updated. Try "export HWLOC_COMPONENTS=-nvml" to disable the buggy NVML code. And this patch should fix it (but I'll apply a different one to factorize this with the Linux and PCI code):
--- a/hwloc/topology-nvml.c
+++ b/hwloc/topology-nvml.c
@@ -257,7 +257,7 @@ hwloc_nvml_discover(struct hwloc_backend *backend, struct hwloc_disc_status *dst
* PCIe Gen2 = 5 GT/s signal-rate per lane with 8/10 encoding = 0.5 GB/s data-rate per lane
* PCIe Gen3 = 8 GT/s signal-rate per lane with 128/130 encoding = 1 GB/s data-rate per lane
*/
- lanespeed = maxgen <= 2 ? 2.5 * maxgen * 0.8 : 8.0 * 128/130; /* Gbit/s per lane */
+ lanespeed = maxgen <= 2 ? 2.5 * maxgen * 0.8 : 8.0 * (1<<(maxgen-3)) * 128/130; /* Gbit/s per lane */
if (lanespeed * maxwidth != 0.)
/* we found the max link speed, replace the current link speed found by pci (or none) */
parent->attr->pcidev.linkspeed = lanespeed * maxwidth / 8; /* GB/s */
HWLOC_COMPONENTS=-nvml
fixes it. Thank you!
I am posting 2.11rc1 right now with the proper fix.
https://github.com/open-mpi/hwloc/blob/39fae7e3151fbd677f953f6990acd9eb2a0b9bfb/hwloc/topology-pci.c#L313