After I make this project and prepare to start a test, I met this bug:
when I run mpirun --allow-run-as-root -np 2 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5, I got:
# minBytes 3145728 maxBytes 50331648 step: 2(factor) warmup iters: 20 iters: 100 validation: 1 graph: 100 kernel num: 5
#
# Using devices
# Rank 0 Pid 106967 on e53955024127 device 0 [0000:3A:00.0] NVIDIA RTX 6000 Ada Generation
# Rank 1 Pid 106968 on e53955024127 device 1 [0000:AD:00.0] NVIDIA RTX 6000 Ada Generation
[e53955024127:106967] *** Process received signal ***
[e53955024127:106967] Signal: Segmentation fault (11)
[e53955024127:106967] Signal code: Address not mapped (1)
[e53955024127:106967] Failing at address: 0x8
[e53955024127:106967] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fbffe22e090]
[e53955024127:106967] [ 1] #
# Initializing MSCCL++
/usr/local/mscclpp/lib/libmscclpp.so.0(_ZN7mscclpp12TcpBootstrap14createUniqueIdEv+0xd)[0x7fbffea05fad]
[e53955024127:106967] [ 2] ./test/mscclpp-test/allreduce_test_perf(+0x2672d)[0x563c0facd72d]
[e53955024127:106967] [ 3] ./test/mscclpp-test/allreduce_test_perf(+0x2e488)[0x563c0fad5488]
[e53955024127:106967] [ 4] ./test/mscclpp-test/allreduce_test_perf(+0x19b96)[0x563c0fac0b96]
[e53955024127:106967] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fbffe20f083]
[e53955024127:106967] [ 6] ./test/mscclpp-test/allreduce_test_perf(+0x19e9e)[0x563c0fac0e9e]
[e53955024127:106967] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node e53955024127 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
After I make this project and prepare to start a test, I met this bug:
when I run
mpirun --allow-run-as-root -np 2 ./test/mscclpp-test/allreduce_test_perf -b 3m -e 48m -G 100 -n 100 -w 20 -f 2 -k 5
, I got:I'm using mscclpp with the latest version.