microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications
MIT License
243 stars 35 forks source link

[Perf] Failed to reproduce the performance result for Single-node AllReduce mentioned in README.md #362

Open FC-Li opened 1 week ago

FC-Li commented 1 week ago

Hi, I am trying to compare mscclpp and nccl on the allreduce preformance. I used the following script to get the performance metrics for mscclpp and nccl respectively. For mscclpp:

log_dir="logs"
mkdir -p ${log_dir}

for((k=0;k<8;k++));
do
    maxBytes=4g
    if ((k==2)) 
    then
        maxBytes=2g
    fi
    mpirun --bind-to numa -np 8 ./build/test/mscclpp-test/allreduce_test_perf \
        -b 3m \
        -e ${maxBytes} \
        -G 10 \
        -n 30 \
        -w 10 \
        -f 2 \
        -k ${k} \
        -o "${log_dir}/ar.txt"
done

For nccl using nccl-tests from Nvidia:

mpirun -np 8 ./build/all_reduce_perf \
    -b 3M \
    -e 4G \
    -G 10 \
    -f 2 \
    -d int32 \
    -R 0

What mscclpp got is: Image As showed in the above figure, mscclpp's maximum algBw is underneath 100GB/s while it's supposed to be around 140GB/s as showed in README.md.

What nccl got is: Image The achieved maximum algBw matches README.md roughly.

My nccl version is libnccl.so.2.20.5 and all tests are carried out on a machine with 8xH800.

As we can see, nccl's performance is much better than mscclpp especially when size is big. I suspect mscclpp may not work in its best setting. It would be helpful to reproduce the performance declared in README.md if you could share the test script and the environment configuration which your tests were carried out against.

Binyang2014 commented 1 week ago

Could you try with the python benchmark? https://microsoft.github.io/mscclpp/getting-started/quickstart.html#performance-benchmark. The mscclpp-test maybe not suitable for your case.

Also seems you enabled nvlink-sharp for nccl-test but run mscclpp-test without nvlink-sharp. Test with python benchmark will enable nvlink sharp for both scenarios.
Besides, our result is for A100 machine. For A100 the nvlink speed is 600GB/s, for H800 is 400GB/s. It makes sense that A100 machine is faster than H800

FC-Li commented 1 week ago

@Binyang2014 Thank you for your swift reply. I'll give python benchmark a try and share the result with you later on.

FC-Li commented 1 week ago

@Binyang2014 My kernel version is 4.18.0, older than 5.6, so I encountered some problems when compiling mscclpp python benchmark with nvls support. Updating OS would take some time. Instead I turned nvls off for nccl to see what would happen. The following is what I got. The result now starts to make sense and aligns with your hypothesis.

Image

chhwang commented 5 days ago

Please make sure nvidia_peermem driver is running on your machine. https://github.com/microsoft/mscclpp/blob/main/docs/quickstart.md#prerequisites

chhwang commented 5 days ago

Oh, I missed that you are using H800 GPUs. Then your numbers already look making sense.