Closed ikryukov closed 4 months ago
Can one of the admins verify this patch?
Test command:
mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -x UCC_LOG_LEVEL=info -np 16 ./install/bin/ucc_test_mpi -c allgather -O 0 -v
Perf test:
mpirun -x UCC_TL_UCP_TUNE=allgather:0-inf:@3 --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:/home/ikryukov/work/ucx/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp -x UCC_CL_BASIC_TLS=ucp -np 13 ./install/bin/ucc_perftest -c allgather -F -b 1 -e 4k
ok to test
bot:retest
CI issue seems to be relevant
13:40:50 [ RUN ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
13:40:50 [swx-clx01:196 :0:196] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc5b7600000)
13:40:50 ==== backtrace (tid: 196) ====
13:40:50 0 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc5fe76b564]
13:40:50 1 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7fc5fe76b75f]
13:40:50 2 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7fc5fe76ba46]
13:40:50 3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fc5fdfee520]
13:40:50 4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7fc5fe153e94]
13:40:50 5 /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_mc_cpu.so(+0x137d) [0x7fc5fc1cc37d]
13:40:50 6 /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x10fe) [0x7fc5eea08a5e]
13:40:50 7 /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7fc5fe710bfb]
13:40:50 8 /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7fc5fe70b3fe]
13:40:50 9 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x56533c3ecf80]
13:40:50 10 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x56533c76a6db]
13:40:50 11 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x56533c3ea9d1]
13:40:50 12 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x56533c3ddcaa]
13:40:50 13 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x56533c3de322]
13:40:50 14 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x56533c3de5ae]
13:40:50 15 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x56533c3df689]
13:40:50 16 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x56533c3dfb88]
13:40:50 17 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x56533c3a5e65]
13:40:50 18 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc5fdfd5d90]
13:40:50 19 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fc5fdfd5e40]
13:40:50 20 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x56533c3b76d5]
13:40:50 =================================
13:40:50 make[1]: Leaving directory '/opt/nvidia/src/ucc/build/test/gtest'
13:40:50 make[1]: *** [Makefile:1960: test] Segmentation fault (core dumped)
13:40:50 make: *** [Makefile:995: gtest] Error 2
bot:retest
22:16:15 [ RUN ] test_allgather_alg.alg/int8_Cuda_count_1_inplace_1_bruck
22:16:15 [swx-clx01:402 :0:402] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f2716800002)
22:16:15 ==== backtrace (tid: 402) ====
22:16:15 0 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f275f5d3564]
22:16:15 1 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x3375f) [0x7f275f5d375f]
22:16:15 2 /opt/nvidia/bin/ucx/build-release-mt/lib/libucs.so.0(+0x33a46) [0x7f275f5d3a46]
22:16:15 3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f275edee520]
22:16:15 4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1a7e94) [0x7f275ef53e94]
22:16:15 5 /opt/nvidia/src/ucc/build/src/.libs/ucc/libucc_tl_ucp.so(ucc_tl_ucp_allgather_bruck_progress+0x1133) [0x7f275cfe8a93]
22:16:15 6 /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(+0x12bfb) [0x7f275f578bfb]
22:16:15 7 /opt/nvidia/src/ucc/build/src/.libs/libucc.so.1(ucc_context_progress+0x3e) [0x7f275f5733fe]
22:16:15 8 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x556f80) [0x564795aa4f80]
22:16:15 9 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x8d46db) [0x564795e226db]
22:16:15 10 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5549d1) [0x564795aa29d1]
22:16:15 11 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x547caa) [0x564795a95caa]
22:16:15 12 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x548322) [0x564795a96322]
22:16:15 13 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5485ae) [0x564795a965ae]
22:16:15 14 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549689) [0x564795a97689]
22:16:15 15 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x549b88) [0x564795a97b88]
22:16:15 16 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x50fe65) [0x564795a5de65]
22:16:15 17 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f275edd5d90]
22:16:15 18 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f275edd5e40]
22:16:15 19 /opt/nvidia/src/ucc/build/test/gtest/gtest(+0x5216d5) [0x564795a6f6d5]
bot:retest
bot:retest
What
Implementation of Bruck algorith for allgather collective.
Why ?
This algorith has O(long(N)) complexity and shows great performance on small (1-2Kb) messages (according to research: https://arxiv.org/pdf/2109.08751.pdf)