P2P on Intel machines has not improved

ywxc1997 commented 4 days ago

NVIDIA Open GPU Kernel Modules Version

550.54.15

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

[ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04 LTS

Kernel Release

5.15.0-25-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

[X] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4090

Describe the bug

Firstly, I have confirmed that I am able to use P2P

When I tried P2P on AMD machines, both PCIe 4.0 and PCIe 5.0 machines were able to achieve 25GB/s between the two GPUs in all_deduce_perf and nvbandwidth tests, But when I was on machines from two manufacturers, the INTEL (R) XEON (R) GOLD 6530 and Intel (R) Xeon (R) Gold 6430 models, the bandwidth between GPUs could only reach 20GB/s, whether it was 8 or 2 cards.

I don't know if you've encountered it before, do you have a good solution

To Reproduce

NCCL_P2P_DISABLE=0 NCCL_P2P_LEVEL=SYS ./all_reduce_perf -b 1M -e 2g -g 2 -f 2

Bug Incidence

Always

nvidia-bug-report.log.gz

none

More Info

No response

mylesgoose commented 4 days ago

What happens if you say export NCCL_DEBUG=INFO ... so you can get the information required to solve problem.

ywxc1997 commented 3 days ago

Yes, I tried to obtain useful information, but I didn't find any abnormalities ` nThread 1 nGpus 2 minBytes 1048576 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices Rank 0 Group 0 Pid 1280 on gpus278 device 0 [0x16] NVIDIA GeForce RTX 4090 Rank 1 Group 0 Pid 1280 on gpus278 device 1 [0x36] NVIDIA GeForce RTX 4090 gpus278:1280:1280 [0] NCCL INFO Bootstrap : Using bond0:10.176.2.78<0> gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. gpus278:1280:1280 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) gpus278:1280:1280 [1] NCCL INFO cudaDriverVersion 12040 NCCL version 2.19.3+cuda12.3 gpus278:1280:1297 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so gpus278:1280:1297 [1] NCCL INFO P2P plugin IBext gpus278:1280:1297 [1] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE [RO]; OOB bond0:10.176.2.78<0> gpus278:1280:1297 [1] NCCL INFO Using non-device net plugin version 0 gpus278:1280:1296 [0] NCCL INFO Using non-device net plugin version 0 gpus278:1280:1297 [1] NCCL INFO Using network IBext gpus278:1280:1296 [0] NCCL INFO Using network IBext gpus278:1280:1297 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 36000 commId 0x791e81b38cc809b4 - Init START gpus278:1280:1296 [0] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 16000 commId 0x791e81b38cc809b4 - Init START gpus278:1280:1297 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS gpus278:1280:1297 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff gpus278:1280:1296 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff gpus278:1280:1297 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 gpus278:1280:1297 [1] NCCL INFO P2P Chunksize set to 131072 gpus278:1280:1296 [0] NCCL INFO Channel 00/02 : 0 1 gpus278:1280:1296 [0] NCCL INFO Channel 01/02 : 0 1 gpus278:1280:1296 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 gpus278:1280:1296 [0] NCCL INFO P2P Chunksize set to 131072 gpus278:1280:1297 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer gpus278:1280:1297 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer gpus278:1280:1296 [0] NCCL INFO Connected all rings gpus278:1280:1297 [1] NCCL INFO Connected all rings gpus278:1280:1297 [1] NCCL INFO Connected all trees gpus278:1280:1297 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 gpus278:1280:1297 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer gpus278:1280:1296 [0] NCCL INFO Connected all trees gpus278:1280:1296 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 gpus278:1280:1296 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer gpus278:1280:1297 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 36000 commId 0x791e81b38cc809b4 - Init COMPLETE gpus278:1280:1296 [0] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 16000 commId 0x791e81b38cc809b4 - Init COMPLETE

                                                          out-of-place                       in-place
   size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
 1048576        262144     float     sum      -1    71.08   14.75   14.75      0    71.50   14.66   14.66      0
 2097152        524288     float     sum      -1    119.1   17.61   17.61      0    118.7   17.67   17.67      0
 4194304       1048576     float     sum      -1    220.1   19.05   19.05      0    220.0   19.06   19.06      0
 8388608       2097152     float     sum      -1    422.3   19.87   19.87      0    421.9   19.88   19.88      0
16777216       4194304     float     sum      -1    832.0   20.16   20.16      0    831.4   20.18   20.18      0
33554432       8388608     float     sum      -1   1651.0   20.32   20.32      0   1650.6   20.33   20.33      0
67108864      16777216     float     sum      -1   3289.3   20.40   20.40      0   3288.1   20.41   20.41      0

134217728 33554432 float sum -1 6563.4 20.45 20.45 0 6563.1 20.45 20.45 0 268435456 67108864 float sum -1 13112 20.47 20.47 0 13112 20.47 20.47 0 536870912 134217728 float sum -1 26210 20.48 20.48 0 26207 20.49 20.49 0 1073741824 268435456 float sum -1 52403 20.49 20.49 0 52399 20.49 20.49 0 2147483648 536870912 float sum -1 104789 20.49 20.49 0 104783 20.49 20.49 0 gpus278:1280:1280 [1] NCCL INFO comm 0x561f3b60e070 rank 0 nranks 2 cudaDev 0 busId 16000 - Destroy COMPLETE gpus278:1280:1280 [1] NCCL INFO comm 0x561f3b612c40 rank 1 nranks 2 cudaDev 1 busId 36000 - Destroy COMPLETE Out of bounds values : 0 OK Avg bus bandwidth : 19.5483

`

ywxc1997 commented 3 days ago

The P2P bandwidth has been increased from 16GB/s to 20GB/s, but Intel's PCIe 5.0 is far below the 25GB/s bandwidth that AMD's 4.0 can achieve

mylesgoose commented 3 days ago

@ywxc1997 seems like a platform issue rather than a driver bug. 🤔 also sometimes nvidia can drop the lane speed down to save power.rtx is capped at 4.0 bandwidth regardless if you put into pcie 5.0 agreed.yet it should be working the same speed on Intel and amd systems. Could be a bios setting also managing the link speed. If have set to auto detect on Intel and amd set to pcie. 4. 20gb per sec is pretty fast though..

ywxc1997 commented 3 days ago

I completely agree with your viewpoint. I have tried many BIOS configurations and PCIe configuration adjustments, but none of them have worked. I don't think PCIe P2P is the focus of CPU attention, and there may be significant improvements in PCIe switch models. Thank you for your reply. I will close this issue

tinygrad / open-gpu-kernel-modules