Open MPI broken with rxm

aingerson commented 2 years ago

PR #6940 broke rxm with Open MPI for some collectives, causing some consistent and some intermittent hangs with the OSU benchmarks

aingerson commented 2 years ago

@shefty Tracking this here

shefty commented 2 years ago

Can you provide more details, like which version of ompi, what tests, command line args, etc.?

wzamazon commented 2 years ago

I think @aingerson meant this https://github.com/open-mpi/ompi/issues/9876

aingerson commented 2 years ago

I've tried OMPI 4.0.1, 4.0.7, 4.1.0, and 4.1.2 with OFI 1.12, 1.13, 1.14, and upstream. No change in behavior with any OMPI version changes. OFI 1.13 is ok and 1.14 breaks and I narrowed it down to specifically the CM rework.

OMPI was configured like this: ./configure --prefix=<> --enable-mpi-fortran=no --disable-oshmem --with-libfabric=<>

libfabric was configured like this: ./configure --enable-debug=yes --enable-usnic=no --enable-psm=no --enable-psm2=no --enable-psm3=no --enable-verbs=yes --prefix=<>

I'm running the OSU benchmarks from here. Build as follows:

export CC=<ompi install>/bin/mpicc
export CXX=<ompi install>/bin/mpicxx
export CFLAGS="-I<osu src path>/util"
export LD_LIBRARY_PATH=<ompi install>/lib
./configure --prefix=<> && make -j4 && make install

Some of the hangs are inconsistent but the following test I've always found to be hanging is run as follows: mpirun -np 4 --host n15:2,n16:2 --mca mtl_ofi_provider_include "verbs;ofi_rxm" --mca btl_ofi_provider_include "verbs;ofi_rxm" --mca orte_base_help_aggregate 0 --mca mtl ofi --mca pml cm ./osu_allgather

@shefty you asked if this happens with only 2 ranks and I remember testing it and confirming that it does but now I can't reproduce it. Not sure if that means if it's a race condition that is way less frequent with 2 ranks and I was just lucky the first time or if I tested it incorrectly... 4 is 100% consistent for me though.

shefty commented 2 years ago

Does this impact tcp as well?

aingerson commented 2 years ago

With tcp I see similar behavior though sometimes on the osu_allgather I get a hang and sometimes I get a segfault:

# OSU MPI Allgather Latency Test v5.5
# Size       Avg Latency(us)
[n16:174409] *** Process received signal ***
[n16:174409] Signal: Segmentation fault (11)
[n16:174409] Signal code: Address not mapped (1)
[n16:174409] Failing at address: 0x20
[n16:174409] [ 0] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0x101492)[0x7f670b751492]
[n16:174409] [ 1] /usr/lib64/libpthread.so.0(+0xf680)[0x7f671cccb680]
[n16:174409] [ 2] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0xcf36e)[0x7f670b71f36e]
[n16:174409] [ 3] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0xcfa88)[0x7f670b71fa88]
[n16:174409] [ 4] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0xcfc38)[0x7f670b71fc38]
[n16:174409] [ 5] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0x4cebe)[0x7f670b69cebe]
[n16:174409] [ 6] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0x4c02c)[0x7f670b69c02c]
[n16:174409] [ 7] /home/aingerson/workspace/install/libfabric/head/lib/libfabric.so.1(+0x4c388)[0x7f670b69c388]
[n16:174409] [ 8] /home/aingerson/ompi/lib/openmpi/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xa9)[0x7f670a3d1f29]
[n16:174409] [ 9] /home/aingerson/ompi/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f671c35da8c]
[n16:174409] [10] /home/aingerson/ompi/lib/libmpi.so.40(ompi_request_default_wait+0x125)[0x7f671d747915]
[n16:174409] [11] /home/aingerson/ompi/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xd1)[0x7f671d7981c1]
[n16:174409] [12] /home/aingerson/ompi/lib/libmpi.so.40(ompi_coll_base_allgather_intra_recursivedoubling+0x12c)[0x7f671d79670c]
[n16:174409] [13] /home/aingerson/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allgather_intra_dec_fixed+0x85)[0x7f670a1bf8d5]
[n16:174409] [14] /home/aingerson/ompi/lib/libmpi.so.40(MPI_Allgather+0x13f)[0x7f671d75a13f]
[n16:174409] [15] /home/aingerson/ompi/osu/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather[0x401d07]
[n16:174409] [16] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f671c9113d5]
[n16:174409] [17] /home/aingerson/ompi/osu/libexec/osu-micro-benchmarks/mpi/collective/osu_allgather[0x4014d9]
[n16:174409] *** End of error message ***

shefty commented 2 years ago

I'm trying to reproduce this, but I'm currently stuck at the 'cannot build openmpi' stage'.

https://github.com/open-mpi/ompi/issues/9920

shefty commented 2 years ago

@aingerson - can you provide a list of the exact commands that you're running to hit these errors?

aingerson commented 2 years ago

export FI_PROVIDER=verbs;ofi_rxm
export PATH=<ompi_bin>:$PATH
export LD_LIBRARY_PATH=<libfabric_lib>:<ompi_lib>
mpirun -np 4 --host n15:2,n16:2 --mca mtl_ofi_provider_include "verbs;ofi_rxm" --mca btl_ofi_provider_include "verbs;ofi_rxm" --mca orte_base_help_aggregate 0 --mca mtl ofi --mca pml cm ./osu_allgather

shefty commented 2 years ago

Is this the same for tcp as well?

aingerson commented 2 years ago

yes

shefty commented 2 years ago

reported segfault

#2 0x00007ff9ab92af3b in rxm_handle_comp_error (rxm_ep=0x1fd0a10) at prov/rxm/src/rxm_cq.c:1724
#3 0x00007ff9ab92b655 in rxm_ep_do_progress (util_ep=0x1fd0a10) at prov/rxm/src/rxm_cq.c:1887
#4 0x00007ff9ab92b7f8 in rxm_ep_progress (util_ep=0x1fd0a10) at prov/rxm/src/rxm_cq.c:1921
#5 0x00007ff9ab8a8e41 in ofi_cq_progress (cq=0x1fcffe0) at prov/util/src/util_cq.c:567
#6 0x00007ff9ab8a7faf in ofi_cq_readfrom (cq_fid=0x1fcffe0, buf=0x22bd390, count=100, src_addr=0x0) at prov/util/src/util_cq.c:230
#7 0x00007ff9ab8a830b in ofi_cq_read (cq_fid=0x1fcffe0, buf=0x22bd390, count=100) at prov/util/src/util_cq.c:283
#8 0x00007ff9aa5e6f29 in ompi_mtl_ofi_progress_no_inline () from /home/aingerson/ompi/lib/openmpi/mca_mtl_ofi.so
#9 0x00007ff9bc670a8c in opal_progress () from /home/aingerson/ompi/lib/libopen-pal.so.40
#10 0x00007ff9bda5a915 in ompi_request_default_wait () from /home/aingerson/ompi/lib/libmpi.so.40
#11 0x00007ff9bdaab1c1 in ompi_coll_base_sendrecv_actual () from /home/aingerson/ompi/lib/libmpi.so.40
#12 0x00007ff9bdaa970c in ompi_coll_base_allgather_intra_recursivedoubling () from /home/aingerson/ompi/lib/libmpi.so.40
#13 0x00007ff9aa3d48d5 in ompi_coll_tuned_allgather_intra_dec_fixed () from /home/aingerson/ompi/lib/openmpi/mca_coll_tuned.so
#14 0x00007ff9bda6d13f in PMPI_Allgather () from /home/aingerson/ompi/lib/libmpi.so.40
#15 0x0000000000401d07 in main ()

shefty commented 2 years ago

additional backtrace data (provided by Alexia)

#0 tcpx_cq_report_error (cq=0x24fd090, xfer_entry=0x27f8f30, err=107) at prov/tcp/src/tcpx_cq.c:213
#1 0x00007f35caf4cee6 in tcpx_progress_tx (ep=0x286cf40) at prov/tcp/src/tcpx_progress.c:119
#2 0x00007f35caf4efe8 in tcpx_tx_queue_insert (ep=0x286cf40, tx_entry=0x27f8f30) at prov/tcp/src/tcpx_progress.c:819
#3 0x00007f35caf461da in tcpx_queue_send (ep=0x286cf40, tx_entry=0x27f8f30) at prov/tcp/src/tcpx_msg.c:264
#4 0x00007f35caf469f4 in tcpx_tinject (fid_ep=0x286cf40, buf=0x7f35c64b9002, len=1, dest_addr=0, tag=4294967286)
at prov/tcp/src/tcpx_msg.c:541
#5 0x00007f35caefe0f1 in fi_tinject (ep=0x286cf40, buf=0x7f35c64b9002, len=1, dest_addr=0, tag=4294967286)
at ./include/rdma/fi_tagged.h:134
#6 0x00007f35caf026d3 in rxm_msg_tinject (msg_ep=0x286cf40, buf=0x7f35c64b9002, len=1, cq_data=true, data=2, tag=4294967286)
at prov/rxm/src/rxm_ep.c:1312
#7 0x00007f35caf02804 in rxm_ep_inject_send (rxm_ep=0x24fda00, rxm_conn=0x27ef860, buf=0x7f35c64b9002, len=1)
at prov/rxm/src/rxm_ep.c:1329
#8 0x00007f35caf05169 in rxm_ep_tinjectdata (ep_fid=0x24fda00, buf=0x7f35c64b9002, len=1, data=2, dest_addr=1, tag=4294967286)
at prov/rxm/src/rxm_ep.c:2222

There is an issue where the tcp provider write a CQ entry for inject messages, if the inject operation fails. This results in the NULL context, which gets reported back to rxm. AFAICT, tcp has always had this issue. I suspect this is just a side effect of whatever is causing the connections to be destroyed.

shefty commented 2 years ago

PR #7408 targets a fix for the segfault. The hang is still under debug.

shefty commented 2 years ago

Btw, the segfault occurs because an inject message failed to be transferred. An error entry was written to the CQ for this. Fixing the segfault means that the code is more likely to hang, as the target will never receive the message and MPI has no way of knowing that the message was lost. Of concern is why the underlying connection failed.

shefty commented 2 years ago

Debug analysis:

rank 1 sends an inject message to rank 3. That creates a connection from rank 1 to 3. The connection is established, and the inject message is transferred.

Meanwhile on rank 3, it wants to connect to rank 1. It starts to connect, but sees the connection request from rank 1. Rank 1's connection is given priority, so rank 3 aborts its connection. It accepts the connection from rank 1.

Returning to rank 1, it now sees a connection request from rank 3. It assumes that its existing connection must be stale, so it closes it and instead accepts the request from rank 3. The inject message sent above now fails, but MPI doesn't care if it fails, so the completion gets dropped.

There is now no active connection between 1 and 3, and messages sent over the old connection may or may not have been received.

Somehow on rank 1, it sees the connection accept event first, followed by the connect request. So, while rank 3 detects this as a simultaneous connection request, rank 1 instead sees them as serialized.

shefty commented 2 years ago

The connection problem comes from this section of code in rxm_process_connreq():

    case RXM_CM_CONNECTED:
        FI_INFO(&rxm_prov, FI_LOG_EP_CTRL,
            "old connection accepting/done, replacing %p\n", conn);
        rxm_close_conn(conn);
        break;

This code was added to handle a case in DAOS where a client would crash, then try to reconnect to the server. The server needed to detect that the old connection was invalid and allow the new connection to proceed. So, it closes the existing connection and allows the new one to proceed.

However, this breaks the simultaneous connection case. Connection events may be reported out of order. This shows up easier with the tcp provider, but verbs has a similar issue. The peer detects simultaneous connections, but locally, the connections show up as serial. Locally, we discard the good connection, then try to accept the new request. But the peer has actually done the opposite - maintains the good connection, but discards its request. There's a race where data can be transferred from the peer over the good connection before its torn down. Tearing down the connection causes the transfer to fail, which generates a CQ error, which led to the segfault. The transfer is an inject, so there's no way to report an error. Fixing the segfault still hangs MPI. Modifying the above code fixes the simultaneous connection problem, but will break DAOS again. I don't have a solution for this problem yet.

The better news is that even hacking the above code to avoid the segfault still doesn't allow OMPI to work. MPI still hangs. I believe the issue there is that an unexpected message arrives, which is never matched with the correct local buffer. The reason for the mismatch is that the source address is not set correctly (shows as FI_ADDR_UNSPEC). Why that happens I don't know, but I suspect it's also related to connection setup racing.

shefty commented 2 years ago

Random prints added to the code:

[1,3]<stdout>:matching rx - addr: 0xffffffffffffffff tag: 0xfffffff6 
[1,3]<stdout>:unexpected - addr: 0xffffffffffffffff tag: 0xfffffff6 
[1,3]<stdout>:matching rx - addr: 0x0 tag: 0xfffffff6 
[1,3]<stdout>:searching unexpected queue - want addr: 0x3 tag: 0xfffffff6 
[1,3]<stdout>:searching unexpected queue - unexp addr: 0xffffffffffffffff tag: 0xfffffff6

This is showing that a message is received over a connection for an address that has not yet been inserted into the AV. Line 2 shows that the connection is associated with entry UNSPEC. The message is queued to the unexpected queue. Later, when an address for the connection is entered into the AV, it fails to find the queued message.

This should occur with both tcp and verbs providers. I'm guessing Intel MPI doesn't use the same tag matching algorithm that OMPI uses, so does not see this issue. But it's possible other apps could hit this same issue if source address matching is used.

shefty commented 2 years ago

Please retest with the latest main. Fixes were added to handle a race with simultaneous connection requests, which was the underlying cause of the 3 identified problems.

shefty commented 2 years ago

Fixes available in main.

aingerson commented 2 years ago

Reopening issue because problems are persisting. MPICH also shows same behavior.

ofiwg / libfabric

Open MPI broken with rxm #7396