openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
177 stars 85 forks source link

segfault when nranks>=18 #981

Closed angainor closed 1 month ago

angainor commented 1 month ago

Hi, I am compiling OpenMPI 4.1.6 against UCC and UCX shipped with HPCX 2.17.1. Running any OSU collective benchmark (e.g. osu_alltoall) results in a segfault in UCC when I start more than 17 ranks. This is what I get:

[b1331:73064:0:73064] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

# OSU MPI All-to-All Personalized Exchange Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)

/cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/ucc/../../../../opal/threads/thread_usage.h: [ opal_thread_add_fetch_32() ]
      ...
      149     return old;                                                         \
      150 }
      151 
==>   152 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, add, +, 32)
      153 OPAL_THREAD_DEFINE_ATOMIC_OP(size_t, add, +, size_t)
      154 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, and, &, 32)
      155 OPAL_THREAD_DEFINE_ATOMIC_OP(int32_t, or, |, 32)

==== backtrace (tid:  73064) ====
 0 0x0000000000003ef7 opal_thread_add_fetch_32()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/ucc/../../../../opal/threads/thread_usage.h:152
 1 0x0000000000003ef7 opal_obj_update()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/ucc/../../../../opal/class/opal_object.h:534
 2 0x0000000000003ef7 mca_coll_ucc_module_destruct()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/ucc/coll_ucc_module.c:72
 3 0x0000000000088eab opal_obj_run_destructors()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/../../../opal/class/opal_object.h:483
 4 0x0000000000088eab mca_coll_base_comm_select()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mca/coll/base/coll_base_comm_select.c:229
 5 0x00000000000d1a6f ompi_mpi_init()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/runtime/ompi_mpi_init.c:958
 6 0x00000000000779dd PMPI_Init()  /cluster/home/marcink/src/openmpi-4.1.6/ompi/mpi/c/profile/pinit.c:67
 7 0x0000000000406aa9 omb_mpi_init()  /cluster/home/marcink/src/osu-micro-benchmarks-7.4/c/mpi/pt2pt/standard/../../../util/osu_util_mpi.c:659
 8 0x00000000004025eb main()  /cluster/home/marcink/src/osu-micro-benchmarks-7.4/c/mpi/collective/blocking/osu_alltoall.c:53
 9 0x0000000000022545 __libc_start_main()  ???:0
10 0x00000000004031d3 _start()  ???:0
=================================
[b1331:73064] *** Process received signal ***
[b1331:73064] Signal: Segmentation fault (11)
[b1331:73064] Signal code:  (-6)
[b1331:73064] Failing at address: 0xc8fd00011d68
[b1331:73064] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b6717414630]
[b1331:73064] [ 1] /cluster/home/marcink/software/openmpi/4.1.6-ucc-2.17.1/lib/openmpi/mca_coll_ucc.so(+0x3ef7)[0x2b6722322ef7]
[b1331:73064] [ 2] /cluster/home/marcink/software/openmpi/4.1.6-ucc-2.17.1/lib/libmpi.so.40(mca_coll_base_comm_select+0x267b)[0x2b671729ceab]
[b1331:73064] [ 3] /cluster/home/marcink/software/openmpi/4.1.6-ucc-2.17.1/lib/libmpi.so.40(ompi_mpi_init+0xeff)[0x2b67172e5a6f]
[b1331:73064] [ 4] /cluster/home/marcink/software/openmpi/4.1.6-ucc-2.17.1/lib/libmpi.so.40(MPI_Init+0x5d)[0x2b671728b9dd]
[b1331:73064] [ 5] ./osu_alltoall[0x406aa9]
[b1331:73064] [ 6] ./osu_alltoall[0x4025eb]
[b1331:73064] [ 7] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b6717b6a545]
[b1331:73064] [ 8] ./osu_alltoall[0x4031d3]
[b1331:73064] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 15 with PID 0 on node b1331 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I also added verbosity to the ucc mca component, and see the following messages when OpenMPI initializes - not sure if this is relevant:

[b1331.betzy.sigma2.no:73064] coll_ucc_module.c:404 - mca_coll_ucc_module_enable() creating ucc_team for comm 0x4265e0, comm_id 0, comm_size 18
[b1331.betzy.sigma2.no:73064] Error: coll_ucc_module.c:408 - mca_coll_ucc_module_enable() mca_coll_ucc_save_coll_handlers failed

Does anyone have ideas what the problem might be?

Thanks!

Sergei-Lebedev commented 1 month ago

_mca_coll_ucc_save_collhandlers fails only if there is no other collective component to fallback. Can you please share the command line you use to run this test?

angainor commented 1 month ago

This was tested by

mpirun -mca coll_hcoll_enable 0 -mca coll_ucc_enable 1 -mca coll_ucc_verbose 100 ./osu_alltoall

but except of hcoll there are the standard tuned collectives, so I think there is a fallback.

angainor commented 1 month ago

Also, the test works when I run on fewer than 18 ranks.

Sergei-Lebedev commented 1 month ago

can you pls rerun adding --mca coll_ucc_priority 100, might be related to this issue https://github.com/open-mpi/ompi/issues/9885

angainor commented 1 month ago

Well that seems to do the trick for me! Here are some sample tests for osu_allreduce on 512 ranks / 4 compute nodes / HDR100 / HPCX 2.17.1. It seems that UCC is still quite a bit slower than HCOLL on this system. Is there anything I could tune here to get better results? note that I set UCX_PROTO_ENABLE=n, because otherwise HCOLL suffers (https://github.com/openucx/ucx/issues/9914), while this setting seems to have little impact on UCC.

HCOLL

mpirun -x UCX_PROTO_ENABLE=n -H b3115:128,b3142:128,b3331:128,b4132:128 -mca coll_ucc_priority 100 -mca coll_ucc_enable 0 -mca coll_hcoll_enable 1 ./osu_allreduce 

# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       7.98
2                       7.40
4                       7.66
8                       7.70
16                      7.90
32                      8.93
64                      7.77
128                     8.10
256                    10.13
512                    10.02
1024                   11.73
2048                   14.36
4096                   19.15
8192                   28.63
16384                  47.27
32768                 498.13
65536                 152.94
131072                235.62
262144                443.80
524288                752.25
1048576              1519.43

UCC

mpirun -x UCX_PROTO_ENABLE=n -H b3115:128,b3142:128,b3331:128,b4132:128 -mca coll_ucc_priority 100 -mca coll_ucc_enable 1 -mca coll_hcoll_enable 0 ./osu_allreduce 

# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      22.92
2                      20.61
4                      20.22
8                      20.19
16                     20.18
32                     20.80
64                     21.87
128                    25.90
256                    28.11
512                    36.36
1024                   56.27
2048                  104.75
4096                   48.22
8192                   58.38
16384                  79.96
32768                 155.20
65536                 265.00
131072                419.92
262144                771.05
524288               1541.54
1048576              3233.63
Sergei-Lebedev commented 1 month ago

There was an issue with MPI_CHAR datatype and reductions in HPCX 2.17, I think the fix was included in HPCX 2.18 (https://github.com/openucx/ucc/pull/918). For other cases @nsarka is working on allreduce performance improvements.

angainor commented 1 month ago

thanks a lot! I'll run more perf tests.