open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 844 forks source link

OMPI 5.0.x MTT Failure on AWS without EFA (no libfabric): ibm/collective/allreduce_in_place #11480

Closed a-szegel closed 1 year ago

a-szegel commented 1 year ago

Background information

I was able to reproduce allreduce_in_place segfault for ompi 5.0.x without using EFA. The allreduce_in_place passes on the same node, but starts seg faulting as soon as I make the test run multi-node with shared memory. 2 ranks on 1 node pass, 2 ranks on 2 nodes pass, but 3 ranks on 2 nodes fail.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x, c5fe4aa9a623a86f2e1da9dfae3d2dbdffe0de40

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone ompi git checkout v5.0.x git submodule update --init --recursive ./autogen.pl && ./configure --enable-debug --prefix=/home/ec2-user/ompi/install && make -j install

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status 7f6f8db13b42916b27b690b8a3f9e2757ec1417f 3rd-party/openpmix (v4.2.3-8-g7f6f8db1) c7b2c715f92495637c298249deb5493e86864ac8 3rd-party/prrte (v3.0.1rc1-36-gc7b2c715) 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff1)

Please describe the system on which you are running


Details of the problem

$HOME/ompi/install/bin/mpirun -np 3 -N 2 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allreduce_in_place
Warning: Permanently added 'queue-c5n18xlarge-st-c5n18xlarge-4,10.0.2.137' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c5n18xlarge-st-c5n18xlarge-1,10.0.2.69' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c5n18xlarge-st-c5n18xlarge-3,10.0.2.128' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c5n18xlarge-st-c5n18xlarge-2,10.0.2.168' (ECDSA) to the list of known hosts.
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] *** Process received signal ***
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] Signal: Segmentation fault (11)
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] Signal code: Address not mapped (1)
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] Failing at address: 0x10001
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] *** Process received signal ***
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] Signal: Segmentation fault (11)
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] Signal code: Address not mapped (1)
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] Failing at address: 0x10001
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7fb076ffa8e0]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 1] [queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 0] /lib64/libc.so.6(+0x14dbcf)[0x7fb076d89bcf]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 2] /lib64/libpthread.so.0(+0x118e0)[0x7f57b63098e0]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 1] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(+0x72600)[0x7fb07675f600]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 3] /lib64/libc.so.6(+0x14dbcf)[0x7f57b6098bcf]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 2] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(+0x73081)[0x7fb076760081]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 4] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(+0xc16ca)[0x7f57b5abd6ca]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 3] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(opal_datatype_copy_content_same_ddt+0x109)[0x7fb076761c50]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 5] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x2eb3ad)[0x7f57b68013ad]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 4] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x1b60a1)[0x7fb0773bd0a1]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 6] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_pml_ob1_send_request_schedule_once+0x2cf)[0x7f57b680466f]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 5] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_coll_self_reduce_intra+0x47)[0x7fb0773bd131]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 7] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x2e0715)[0x7f57b67f6715]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 6] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x1f6267)[0x7fb0773fd267]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 8] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x2e0776)[0x7f57b67f6776]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 7] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_ack+0x33a)[0x7f57b67f8829]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 8] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x1f41fc)[0x7fb0773fb1fc]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [ 9] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(mca_btl_sm_poll_handle_frag+0x19b)[0x7f57b5ac0eb1]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [ 9] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_coll_han_allreduce_intra+0x1738)[0x7fb0773fca17]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [10] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(+0xc30f3)[0x7f57b5abf0f3]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [10] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_coll_han_allreduce_intra_dynamic+0x3bc)[0x7fb077409c43]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [11] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(+0xc520f)[0x7f57b5ac120f]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [11] /home/ec2-user/ompi/install/lib/libopen-pal.so.80(opal_progress+0x30)[0x7f57b5a1ed46]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [12] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x2d9654)[0x7f57b67ef654]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [13] /home/ec2-user/ompi/install/lib/libmpi.so.40(PMPI_Allreduce+0x403)[0x7fb0772ca6ec]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [12] /home/ec2-user/ompi-tests/ibm/collective/allreduce_in_place[0x400e6d]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [13] /lib64/libc.so.6(__libc_start_main+0xea)[0x7fb076c5d13a]
[queue-c5n18xlarge-st-c5n18xlarge-4:12841] [14] /home/ec2-user/ompi-tests/ibm/collective/allreduce_in_place[0x400cfa]
/home/ec2-user/ompi/install/lib/libmpi.so.40(mca_pml_ob1_send+0x5ea)[0x7f57b67f21cf]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [14] [queue-c5n18xlarge-st-c5n18xlarge-4:12841] *** End of error message ***
/home/ec2-user/ompi/install/lib/libmpi.so.40(ompi_coll_base_reduce_intra_basic_linear+0x90)[0x7f57b66a1bbc]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [15] /home/ec2-user/ompi/install/lib/libmpi.so.40(ompi_coll_tuned_reduce_intra_do_this+0xd4)[0x7f57b66c6b59]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [16] /home/ec2-user/ompi/install/lib/libmpi.so.40(ompi_coll_tuned_reduce_intra_dec_fixed+0x46e)[0x7f57b66bead8]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [17] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x1f6267)[0x7f57b670c267]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [18] /home/ec2-user/ompi/install/lib/libmpi.so.40(+0x1f41fc)[0x7f57b670a1fc]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [19] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_coll_han_allreduce_intra+0x1738)[0x7f57b670ba17]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [20] /home/ec2-user/ompi/install/lib/libmpi.so.40(mca_coll_han_allreduce_intra_dynamic+0x3bc)[0x7f57b6718c43]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [21] /home/ec2-user/ompi/install/lib/libmpi.so.40(PMPI_Allreduce+0x403)[0x7f57b65d96ec]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [22] /home/ec2-user/ompi-tests/ibm/collective/allreduce_in_place[0x400e6d]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [23] /lib64/libc.so.6(__libc_start_main+0xea)[0x7f57b5f6c13a]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] [24] /home/ec2-user/ompi-tests/ibm/collective/allreduce_in_place[0x400cfa]
[queue-c5n18xlarge-st-c5n18xlarge-1:13137] *** End of error message ***
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] *** An error occurred in Socket closed
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] *** reported by process [942473217,0]
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] *** on a NULL communicator
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] *** Unknown error
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[queue-c5n18xlarge-st-c5n18xlarge-1:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 0 on node queue-c5n18xlarge-st-c5n18xlarge-4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Core 1:

(gdb) bt
#0  0x00007fce3df7ebcf in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007fce3d954600 in opal_datatype_accelerator_memcpy (dest=0x158eec0, src=0x10001, size=65536) at opal_datatype_copy.c:72
#2  0x00007fce3d955081 in non_overlap_accelerator_copy_content_same_ddt (datatype=0x603100 <ompi_mpi_int>, count=16384, destination_base=0x158eec0 "", 
    source_base=0x10001 <error: Cannot access memory at address 0x10001>) at opal_datatype_copy.h:151
#3  0x00007fce3d956c50 in opal_datatype_copy_content_same_ddt (datatype=0x603100 <ompi_mpi_int>, count=16384, destination_base=0x158eec0 "", 
    source_base=0x10001 <error: Cannot access memory at address 0x10001>) at opal_datatype_copy.c:160
#4  0x00007fce3e5b20a1 in ompi_datatype_copy_content_same_ddt (type=0x603100 <ompi_mpi_int>, count=16384, pDestBuf=0x158eec0 "", 
    pSrcBuf=0x10001 <error: Cannot access memory at address 0x10001>) at ../../../../ompi/datatype/ompi_datatype.h:288
#5  0x00007fce3e5b2131 in mca_coll_self_reduce_intra (sbuf=0x10001, rbuf=0x158eec0, count=16384, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, root=0, 
    comm=0x15ee600, module=0x15fa270) at coll_self_reduce.c:44
#6  0x00007fce3e5f2267 in mca_coll_han_allreduce_t1_task (task_args=0x15e1550) at coll_han_allreduce.c:265
#7  0x00007fce3e5f01fc in issue_task (t=0x1600b70) at coll_han_trigger.h:55
#8  0x00007fce3e5f1a17 in mca_coll_han_allreduce_intra (sbuf=0x1, rbuf=0x157eec0, count=100000, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>, module=0x157d0b0) at coll_han_allreduce.c:164
#9  0x00007fce3e5fec43 in mca_coll_han_allreduce_intra_dynamic (sbuf=0x1, rbuf=0x157eec0, count=100000, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>, module=0x157d0b0) at coll_han_dynamic.c:704
#10 0x00007fce3e4bf6ec in PMPI_Allreduce (sendbuf=0x1, recvbuf=0x157eec0, count=100000, datatype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>) at allreduce.c:123
#11 0x0000000000400e6d in main (argc=1, argv=0x7ffdad5df098) at allreduce_in_place.c:65

Core 2:

#0  0x00007f7fde60bbcf in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007f7fde0306ca in sm_prepare_src (btl=0x7f7fde296de0 <mca_btl_sm>, endpoint=0x11b01a0, convertor=0x11aece0, order=255 '\377', reserve=32, size=0x7ffe914692c8, flags=70)
    at btl_sm_module.c:489
#2  0x00007f7fded743ad in mca_bml_base_prepare_src (bml_btl=0x11b26a0, conv=0x11aece0, order=255 '\377', reserve=32, size=0x7ffe914692c8, flags=70, des=0x7ffe914692d0)
    at ../../../../ompi/mca/bml/bml.h:339
#3  0x00007f7fded7766f in mca_pml_ob1_send_request_schedule_once (sendreq=0x11aec00) at pml_ob1_sendreq.c:1179
#4  0x00007f7fded69715 in mca_pml_ob1_send_request_schedule_exclusive (sendreq=0x11aec00) at pml_ob1_sendreq.h:327
#5  0x00007f7fded69776 in mca_pml_ob1_send_request_schedule (sendreq=0x11aec00) at pml_ob1_sendreq.h:351
#6  0x00007f7fded6b829 in mca_pml_ob1_recv_frag_callback_ack (btl=0x7f7fde296de0 <mca_btl_sm>, descriptor=0x7ffe914693e0) at pml_ob1_recvfrag.c:772
#7  0x00007f7fde033eb1 in mca_btl_sm_poll_handle_frag (hdr=0x7f7fd8e75e00, endpoint=0x11b01a0) at btl_sm_component.c:452
#8  0x00007f7fde0320f3 in mca_btl_sm_check_fboxes () at ../../../../opal/mca/btl/sm/btl_sm_fbox.h:283
#9  0x00007f7fde03420f in mca_btl_sm_component_progress () at btl_sm_component.c:553
#10 0x00007f7fddf91d46 in opal_progress () at runtime/opal_progress.c:224
#11 0x00007f7fded62654 in ompi_request_wait_completion (req=0x11aec00) at ../../../../ompi/request/request.h:492
#12 0x00007f7fded651cf in mca_pml_ob1_send (buf=0x10001, count=16384, datatype=0x603100 <ompi_mpi_int>, dst=0, tag=-21, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x121e080)
    at pml_ob1_isend.c:327
#13 0x00007f7fdec14bbc in ompi_coll_base_reduce_intra_basic_linear (sbuf=0x10001, rbuf=0x11cb720, count=16384, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    root=0, comm=0x121e080, module=0x122d650) at base/coll_base_reduce.c:663
#14 0x00007f7fdec39b59 in ompi_coll_tuned_reduce_intra_do_this (sbuf=0x10001, rbuf=0x11cb720, count=16384, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, root=0, 
    comm=0x121e080, module=0x122d650, algorithm=1, faninout=0, segsize=0, max_requests=0) at coll_tuned_reduce_decision.c:161
#15 0x00007f7fdec31ad8 in ompi_coll_tuned_reduce_intra_dec_fixed (sendbuf=0x10001, recvbuf=0x11cb720, count=16384, datatype=0x603100 <ompi_mpi_int>, 
    op=0x603500 <ompi_mpi_op_sum>, root=0, comm=0x121e080, module=0x122d650) at coll_tuned_decision_fixed.c:811
#16 0x00007f7fdec7f267 in mca_coll_han_allreduce_t1_task (task_args=0x121e380) at coll_han_allreduce.c:265
#17 0x00007f7fdec7d1fc in issue_task (t=0x1232ea0) at coll_han_trigger.h:55
#18 0x00007f7fdec7ea17 in mca_coll_han_allreduce_intra (sbuf=0x1, rbuf=0x11bb720, count=100000, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>, module=0x11b96e0) at coll_han_allreduce.c:164
#19 0x00007f7fdec8bc43 in mca_coll_han_allreduce_intra_dynamic (sbuf=0x1, rbuf=0x11bb720, count=100000, dtype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>, module=0x11b96e0) at coll_han_dynamic.c:704
#20 0x00007f7fdeb4c6ec in PMPI_Allreduce (sendbuf=0x1, recvbuf=0x11bb720, count=100000, datatype=0x603100 <ompi_mpi_int>, op=0x603500 <ompi_mpi_op_sum>, 
    comm=0x603300 <ompi_mpi_comm_world>) at allreduce.c:123
#21 0x0000000000400e6d in main (argc=1, argv=0x7ffe91469d08) at allreduce_in_place.c:65
a-szegel commented 1 year ago

I got re-prioritized so I can't work on this right now but I will pick it back up when I can.

wzamazon commented 1 year ago

duplicate of https://github.com/open-mpi/ompi/issues/11473