openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

TL/UCP: Allgather Bruck algorithm #898

Closed ikryukov closed 4 months ago

ikryukov commented 5 months ago

What

Implementation of Bruck algorith for allgather collective.

Why ?

This algorith has O(long(N)) complexity and shows great performance on small (1-2Kb) messages (according to research: https://arxiv.org/pdf/2109.08751.pdf)

swx-jenkins3 commented 5 months ago

Can one of the admins verify this patch?

ikryukov commented 5 months ago

Test command: mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -x UCC_LOG_LEVEL=info -np 16 ./install/bin/ucc_test_mpi -c allgather -O 0 -v Perf test: mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -np 13 ./install/bin/ucc_perftest -c allgather -F -b 1 -e 4k

Sergei-Lebedev commented 4 months ago

ok to test

Sergei-Lebedev commented 4 months ago

bot:retest

Sergei-Lebedev commented 4 months ago

CI issue seems to be relevant

13:40:50  [ RUN      ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
13:40:50  [swx-clx01:196  :0:196] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc5b7600000)
13:40:50  ==== backtrace (tid:    196) ====
13:40:50   0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc5fe76b564]
13:40:50   1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7fc5fe76b75f]
13:40:50   2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7fc5fe76ba46]
13:40:50   3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc5fdfee520]
13:40:50   4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7fc5fe153e94]
13:40:50   5  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_mc_cpu.so(+0x137d) [0x7fc5fc1cc37d]
13:40:50   6  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x10fe) [0x7fc5eea08a5e]
13:40:50   7  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7fc5fe710bfb]
13:40:50   8  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7fc5fe70b3fe]
13:40:50   9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x56533c3ecf80]
13:40:50  10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x56533c76a6db]
13:40:50  11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x56533c3ea9d1]
13:40:50  12  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x56533c3ddcaa]
13:40:50  13  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x56533c3de322]
13:40:50  14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x56533c3de5ae]
13:40:50  15  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x56533c3df689]
13:40:50  16  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x56533c3dfb88]
13:40:50  17  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x56533c3a5e65]
13:40:50  18  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc5fdfd5d90]
13:40:50  19  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc5fdfd5e40]
13:40:50  20  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x56533c3b76d5]
13:40:50  =================================
13:40:50  make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
13:40:50  make[1]: *** [Makefile:1960: test] Segmentation fault (core dumped)
13:40:50  make: *** [Makefile:995: gtest] Error 2
Sergei-Lebedev commented 4 months ago

bot:retest

Sergei-Lebedev commented 4 months ago
22:16:15  [ RUN      ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
22:16:15  [swx-clx01:402  :0:402] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f2716800002)
22:16:15  ==== backtrace (tid:    402) ====
22:16:15   0  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f275f5d3564]
22:16:15   1  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f275f5d375f]
22:16:15   2  /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f275f5d3a46]
22:16:15   3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f275edee520]
22:16:15   4  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7f275ef53e94]
22:16:15   5  /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x1133) [0x7f275cfe8a93]
22:16:15   6  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7f275f578bfb]
22:16:15   7  /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7f275f5733fe]
22:16:15   8  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x564795aa4f80]
22:16:15   9  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x564795e226db]
22:16:15  10  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x564795aa29d1]
22:16:15  11  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x564795a95caa]
22:16:15  12  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x564795a96322]
22:16:15  13  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x564795a965ae]
22:16:15  14  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x564795a97689]
22:16:15  15  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x564795a97b88]
22:16:15  16  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x564795a5de65]
22:16:15  17  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f275edd5d90]
22:16:15  18  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f275edd5e40]
22:16:15  19  /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x564795a6f6d5]
ikryukov commented 4 months ago

bot:retest

Sergei-Lebedev commented 4 months ago

bot:retest