Closed luoguohao closed 2 years ago
Enviroment
Command
mpirun --allow-run-as-root -np 8 \ --hostfile centos8-hostfile \ --mca orte_base_help_aggregate 0 \ --mca btl tcp,vader,self \ --mca plm_rsh_args "-p 8022" \ --mca btl_tcp_if_include eth0 \ -bind-to none -oversubscribe \ --map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_IB_DISABLE=1 \ nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 5 -g 1 -o all -n 500 -w 10
Perfromence with FastSocket plugin enabled
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1 # # Using devices # Rank 0 Pid 418 on ml-gpu-ser423 device 0 [0x02] Tesla P40 # Rank 1 Pid 419 on ml-gpu-ser423 device 1 [0x03] Tesla P40 # Rank 2 Pid 420 on ml-gpu-ser423 device 2 [0x83] Tesla P40 # Rank 3 Pid 421 on ml-gpu-ser423 device 3 [0x84] Tesla P40 # Rank 4 Pid 488 on ml-gpu-ser604 device 0 [0x02] Tesla P40 # Rank 5 Pid 489 on ml-gpu-ser604 device 1 [0x03] Tesla P40 # Rank 6 Pid 490 on ml-gpu-ser604 device 2 [0x83] Tesla P40 # Rank 7 Pid 491 on ml-gpu-ser604 device 3 [0x84] Tesla P40 # out-of-place in-place # size count type redop time algbw busbw error time algbw busbw error # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float avg 77.44 0.00 0.00 9e-10 77.03 0.00 0.00 9e-10 40 10 float avg 77.25 0.00 0.00 9e-10 77.08 0.00 0.00 9e-10 200 50 float avg 78.30 0.00 0.00 9e-10 78.19 0.00 0.00 9e-10 1000 250 float avg 87.68 0.01 0.02 3e-08 87.59 0.01 0.02 3e-08 5000 1250 float avg 108.3 0.05 0.08 3e-08 108.4 0.05 0.08 3e-08 25000 6250 float avg 261.9 0.10 0.17 3e-08 271.6 0.09 0.16 3e-08 125000 31250 float avg 411.9 0.30 0.53 3e-08 420.6 0.30 0.52 3e-08 625000 156250 float avg 999.8 0.63 1.09 3e-08 977.8 0.64 1.12 3e-08 3125000 781250 float avg 4749.9 0.66 1.15 3e-08 4835.7 0.65 1.13 3e-08 15625000 3906250 float avg 15131 1.03 1.81 3e-08 15210 1.03 1.80 3e-08 78125000 19531250 float avg 71686 1.09 1.91 3e-08 71619 1.09 1.91 3e-08 390625000 97656250 float avg 336844 1.16 2.03 3e-08 337039 1.16 2.03 3e-08
Perfromence with FastSocket plugin disabled
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1 # # Using devices # Rank 0 Pid 418 on ml-gpu-ser423 device 0 [0x02] Tesla P40 # Rank 1 Pid 419 on ml-gpu-ser423 device 1 [0x03] Tesla P40 # Rank 2 Pid 420 on ml-gpu-ser423 device 2 [0x83] Tesla P40 # Rank 3 Pid 421 on ml-gpu-ser423 device 3 [0x84] Tesla P40 # Rank 4 Pid 488 on ml-gpu-ser604 device 0 [0x02] Tesla P40 # Rank 5 Pid 489 on ml-gpu-ser604 device 1 [0x03] Tesla P40 # Rank 6 Pid 490 on ml-gpu-ser604 device 2 [0x83] Tesla P40 # Rank 7 Pid 491 on ml-gpu-ser604 device 3 [0x84] Tesla P40 # out-of-place in-place # size count type redop time algbw busbw error time algbw busbw error # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float avg 149.6 0.00 0.00 9e-10 138.9 0.00 0.00 9e-10 40 10 float avg 138.0 0.00 0.00 9e-10 102.3 0.00 0.00 9e-10 200 50 float avg 96.76 0.00 0.00 9e-10 96.16 0.00 0.00 9e-10 1000 250 float avg 82.18 0.01 0.02 3e-08 82.30 0.01 0.02 3e-08 5000 1250 float avg 103.8 0.05 0.08 3e-08 102.4 0.05 0.09 3e-08 25000 6250 float avg 225.3 0.11 0.19 3e-08 225.4 0.11 0.19 3e-08 125000 31250 float avg 346.6 0.36 0.63 3e-08 345.5 0.36 0.63 3e-08 625000 156250 float avg 961.4 0.65 1.14 3e-08 968.0 0.65 1.13 3e-08 3125000 781250 float avg 4677.0 0.67 1.17 3e-08 4684.6 0.67 1.17 3e-08 15625000 3906250 float avg 13943 1.12 1.96 3e-08 13941 1.12 1.96 3e-08 78125000 19531250 float avg 68384 1.14 2.00 3e-08 68389 1.14 2.00 3e-08 390625000 97656250 float avg 333850 1.17 2.05 3e-08 333890 1.17 2.05 3e-08
Anyone has any suggestions ? am i do the right perfermance tests?
The busbw in your test is about 2GB/s, which is already saturating the 10Gbps NIC bandwidth. I would recommend using 100GbE networks for more significant improvements.
Enviroment
Command
Perfromence with FastSocket plugin enabled
Perfromence with FastSocket plugin disabled
Anyone has any suggestions ? am i do the right perfermance tests?