Closed ywxc1997 closed 3 days ago
What happens if you say export NCCL_DEBUG=INFO ... so you can get the information required to solve problem.
Yes, I tried to obtain useful information, but I didn't find any abnormalities ` nThread 1 nGpus 2 minBytes 1048576 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Using devices Rank 0 Group 0 Pid 1280 on gpus278 device 0 [0x16] NVIDIA GeForce RTX 4090 Rank 1 Group 0 Pid 1280 on gpus278 device 1 [0x36] NVIDIA GeForce RTX 4090 gpus278:1280:1280 [0] NCCL INFO Bootstrap : Using bond0:10.176.2.78<0> gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) gpus278:1280:1280 [1] NCCL INFO cudaDriverVersion 12040 NCCL version 2.19.3+cuda12.3 gpus278:1280:1297 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so gpus278:1280:1297 [1] NCCL INFO P2P plugin IBext gpus278:1280:1297 [1] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [RO]; OOB bond0:10.176.2.78<0> gpus278:1280:1297 [1] NCCL INFO Using non-device net plugin version 0 gpus278:1280:1296 [0] NCCL INFO Using non-device net plugin version 0 gpus278:1280:1297 [1] NCCL INFO Using network IBext gpus278:1280:1296 [0] NCCL INFO Using network IBext gpus278:1280:1297 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 36000 commId 0x791e81b38cc809b4 - Init START gpus278:1280:1296 [0] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 16000 commId 0x791e81b38cc809b4 - Init START gpus278:1280:1297 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS gpus278:1280:1297 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff gpus278:1280:1296 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff gpus278:1280:1297 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 gpus278:1280:1297 [1] NCCL INFO P2P Chunksize set to 131072 gpus278:1280:1296 [0] NCCL INFO Channel 00/02 : 0 1 gpus278:1280:1296 [0] NCCL INFO Channel 01/02 : 0 1 gpus278:1280:1296 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 gpus278:1280:1296 [0] NCCL INFO P2P Chunksize set to 131072 gpus278:1280:1297 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer gpus278:1280:1297 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Connected all rings gpus278:1280:1297 [1] NCCL INFO Connected all rings gpus278:1280:1297 [1] NCCL INFO Connected all trees gpus278:1280:1297 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 gpus278:1280:1297 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer gpus278:1280:1296 [0] NCCL INFO Connected all trees gpus278:1280:1296 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 gpus278:1280:1296 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer gpus278:1280:1297 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 36000 commId 0x791e81b38cc809b4 - Init COMPLETE gpus278:1280:1296 [0] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 16000 commId 0x791e81b38cc809b4 - Init COMPLETE
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 71.08 14.75 14.75 0 71.50 14.66 14.66 0
2097152 524288 float sum -1 119.1 17.61 17.61 0 118.7 17.67 17.67 0
4194304 1048576 float sum -1 220.1 19.05 19.05 0 220.0 19.06 19.06 0
8388608 2097152 float sum -1 422.3 19.87 19.87 0 421.9 19.88 19.88 0
16777216 4194304 float sum -1 832.0 20.16 20.16 0 831.4 20.18 20.18 0
33554432 8388608 float sum -1 1651.0 20.32 20.32 0 1650.6 20.33 20.33 0
67108864 16777216 float sum -1 3289.3 20.40 20.40 0 3288.1 20.41 20.41 0
134217728 33554432 float sum -1 6563.4 20.45 20.45 0 6563.1 20.45 20.45 0 268435456 67108864 float sum -1 13112 20.47 20.47 0 13112 20.47 20.47 0 536870912 134217728 float sum -1 26210 20.48 20.48 0 26207 20.49 20.49 0 1073741824 268435456 float sum -1 52403 20.49 20.49 0 52399 20.49 20.49 0 2147483648 536870912 float sum -1 104789 20.49 20.49 0 104783 20.49 20.49 0 gpus278:1280:1280 [1] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 busId 16000 - Destroy COMPLETE gpus278:1280:1280 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 busId 36000 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 19.5483
`
The P2P bandwidth has been increased from 16GB/s to 20GB/s, but Intel's PCIe 5.0 is far below the 25GB/s bandwidth that AMD's 4.0 can achieve
@ywxc1997 seems like a platform issue rather than a driver bug. 🤔 also sometimes nvidia can drop the lane speed down to save power.rtx is capped at 4.0 bandwidth regardless if you put into pcie 5.0 agreed.yet it should be working the same speed on Intel and amd systems. Could be a bios setting also managing the link speed. If have set to auto detect on Intel and amd set to pcie. 4. 20gb per sec is pretty fast though..
I completely agree with your viewpoint. I have tried many BIOS configurations and PCIe configuration adjustments, but none of them have worked. I don't think PCIe P2P is the focus of CPU attention, and there may be significant improvements in PCIe switch models. Thank you for your reply. I will close this issue
NVIDIA Open GPU Kernel Modules Version
550.54.15
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04 LTS
Kernel Release
5.15.0-25-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
NVIDIA GeForce RTX 4090
Describe the bug
Firstly, I have confirmed that I am able to use P2P
When I tried P2P on AMD machines, both PCIe 4.0 and PCIe 5.0 machines were able to achieve 25GB/s between the two GPUs in all_deduce_perf and nvbandwidth tests, But when I was on machines from two manufacturers, the INTEL (R) XEON (R) GOLD 6530 and Intel (R) Xeon (R) Gold 6430 models, the bandwidth between GPUs could only reach 20GB/s, whether it was 8 or 2 cards.
I don't know if you've encountered it before, do you have a good solution
To Reproduce
NCCL_P2P_DISABLE=0 NCCL_P2P_LEVEL=SYS ./all_reduce_perf -b 1M -e 2g -g 2 -f 2
Bug Incidence
Always
nvidia-bug-report.log.gz
none
More Info
No response