google / nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
Other
109 stars 13 forks source link

NCCL all_reduce performance test on 2 nodes with 10Gbps bandwidth has not any improvements after fastsocket plugin enabled #2

Closed luoguohao closed 2 years ago

luoguohao commented 2 years ago

Enviroment

Command

mpirun --allow-run-as-root -np 8 \
       --hostfile centos8-hostfile \
       --mca orte_base_help_aggregate 0 \
       --mca btl tcp,vader,self \
       --mca plm_rsh_args "-p 8022" \
       --mca btl_tcp_if_include eth0 \
       -bind-to none -oversubscribe \
       --map-by slot \
       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_IB_DISABLE=1 \
      nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 5 -g 1 -o all -n 500 -w 10

Perfromence with FastSocket plugin enabled

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1
#
# Using devices
#   Rank  0 Pid    418 on ml-gpu-ser423 device  0 [0x02] Tesla P40
#   Rank  1 Pid    419 on ml-gpu-ser423 device  1 [0x03] Tesla P40
#   Rank  2 Pid    420 on ml-gpu-ser423 device  2 [0x83] Tesla P40
#   Rank  3 Pid    421 on ml-gpu-ser423 device  3 [0x84] Tesla P40
#   Rank  4 Pid    488 on ml-gpu-ser604 device  0 [0x02] Tesla P40
#   Rank  5 Pid    489 on ml-gpu-ser604 device  1 [0x03] Tesla P40
#   Rank  6 Pid    490 on ml-gpu-ser604 device  2 [0x83] Tesla P40
#   Rank  7 Pid    491 on ml-gpu-ser604 device  3 [0x84] Tesla P40

#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     avg    77.44    0.00    0.00  9e-10    77.03    0.00    0.00  9e-10
          40            10     float     avg    77.25    0.00    0.00  9e-10    77.08    0.00    0.00  9e-10
         200            50     float     avg    78.30    0.00    0.00  9e-10    78.19    0.00    0.00  9e-10
        1000           250     float     avg    87.68    0.01    0.02  3e-08    87.59    0.01    0.02  3e-08
        5000          1250     float     avg    108.3    0.05    0.08  3e-08    108.4    0.05    0.08  3e-08
       25000          6250     float     avg    261.9    0.10    0.17  3e-08    271.6    0.09    0.16  3e-08
      125000         31250     float     avg    411.9    0.30    0.53  3e-08    420.6    0.30    0.52  3e-08
      625000        156250     float     avg    999.8    0.63    1.09  3e-08    977.8    0.64    1.12  3e-08
     3125000        781250     float     avg   4749.9    0.66    1.15  3e-08   4835.7    0.65    1.13  3e-08
    15625000       3906250     float     avg    15131    1.03    1.81  3e-08    15210    1.03    1.80  3e-08
    78125000      19531250     float     avg    71686    1.09    1.91  3e-08    71619    1.09    1.91  3e-08
   390625000      97656250     float     avg   336844    1.16    2.03  3e-08   337039    1.16    2.03  3e-08

Perfromence with FastSocket plugin disabled

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1
#
# Using devices
#   Rank  0 Pid    418 on ml-gpu-ser423 device  0 [0x02] Tesla P40
#   Rank  1 Pid    419 on ml-gpu-ser423 device  1 [0x03] Tesla P40
#   Rank  2 Pid    420 on ml-gpu-ser423 device  2 [0x83] Tesla P40
#   Rank  3 Pid    421 on ml-gpu-ser423 device  3 [0x84] Tesla P40
#   Rank  4 Pid    488 on ml-gpu-ser604 device  0 [0x02] Tesla P40
#   Rank  5 Pid    489 on ml-gpu-ser604 device  1 [0x03] Tesla P40
#   Rank  6 Pid    490 on ml-gpu-ser604 device  2 [0x83] Tesla P40
#   Rank  7 Pid    491 on ml-gpu-ser604 device  3 [0x84] Tesla P40

#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     avg    149.6    0.00    0.00  9e-10    138.9    0.00    0.00  9e-10
          40            10     float     avg    138.0    0.00    0.00  9e-10    102.3    0.00    0.00  9e-10
         200            50     float     avg    96.76    0.00    0.00  9e-10    96.16    0.00    0.00  9e-10
        1000           250     float     avg    82.18    0.01    0.02  3e-08    82.30    0.01    0.02  3e-08
        5000          1250     float     avg    103.8    0.05    0.08  3e-08    102.4    0.05    0.09  3e-08
       25000          6250     float     avg    225.3    0.11    0.19  3e-08    225.4    0.11    0.19  3e-08
      125000         31250     float     avg    346.6    0.36    0.63  3e-08    345.5    0.36    0.63  3e-08
      625000        156250     float     avg    961.4    0.65    1.14  3e-08    968.0    0.65    1.13  3e-08
     3125000        781250     float     avg   4677.0    0.67    1.17  3e-08   4684.6    0.67    1.17  3e-08
    15625000       3906250     float     avg    13943    1.12    1.96  3e-08    13941    1.12    1.96  3e-08
    78125000      19531250     float     avg    68384    1.14    2.00  3e-08    68389    1.14    2.00  3e-08
   390625000      97656250     float     avg   333850    1.17    2.05  3e-08   333890    1.17    2.05  3e-08

Anyone has any suggestions ? am i do the right perfermance tests?

changlan commented 2 years ago

The busbw in your test is about 2GB/s, which is already saturating the 10Gbps NIC bandwidth. I would recommend using 100GbE networks for more significant improvements.