pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
562 stars 279 forks source link

Aurora: Segfaults when message arrvives via shm memory #7203

Open abagusetty opened 1 week ago

abagusetty commented 1 week ago

Thanks to @raffenet @hzhou for figuring it out. Creating the reproducer that was created by @raffenet Needs an Aurora label. Adding info from internal slack:

I think we may have a recursive any_source cancel problem. When a message arrives via shm, we attempt to cancel the netmod partner request, but if that cancel fails we then try to cancel the shm partner? kaboom.

Also reproducible with upstream commits. Backtrace from an app running with the commit: https://github.com/pmodels/mpich/commit/204f8cd396837dc7be1e693484ca1a56ef9d90b4

#0  0x00001519e8def6e3 in MPIDIG_mpi_cancel_recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#1  0x00001519e8def53d in MPIDI_OFI_recv_event () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#2  0x00001519e8ded4fa in MPIDI_OFI_progress_uninlined () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#3  0x00001519e8dbcf4e in MPIDI_NM_mpi_cancel_recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#4  0x00001519e8dba0c9 in MPIDIG_send_target_msg_cb () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#5  0x00001519e8d43bf4 in MPIDI_SHM_progress () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#6  0x00001519e8d4366a in MPIDI_progress_test () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#7  0x00001519e8d40a3e in MPIR_Wait () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#8  0x00001519e8b59282 in PMPI_Recv () from /opt/aurora/24.180.1/spack/unified/0.8.0/install/linux-sles15-x86_64/oneapi-2024.07.30.002/mpich-develop-git.204f8cd-nxqaw43/lib/libmpi.so.12
#9  0x00001519fac944f9 in _progress_server ()
    at /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/exachemdev_mpipr_11-01-2024/TAMM/build_2024.07.30.002-agama996.26-gitmpich/GlobalArrays_External-prefix/src/GlobalArrays_External/comex/src-mpi-pr/comex.c:3429
#10 0x00001519fac81a38 in _comex_init (comm=comm@entry=1140850688)

Reproducer:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

#define COUNT 4

int main(void) {
  int ret;
  MPI_Init(NULL, NULL);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int x = 1000000;
  while (x-- > 0) {
    int buf[COUNT];
    if (rank == 0) {
      MPI_Status status1, status2;
      MPI_Recv(buf, COUNT, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status1);
      MPI_Recv(buf, COUNT, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status2);
    } else {
      MPI_Send(buf, COUNT, MPI_INT, 0, 0, MPI_COMM_WORLD);
    }
  }

  MPI_Finalize();

  return 0;
}
hzhou commented 1 week ago

Thanks! @abagusetty

hzhou commented 1 week ago

@abagusetty How did you get the backtrace? Do you have the location of the segfault?

abagusetty commented 1 week ago

@hzhou The backtrace was generated from a core-dump on Aurora that segfaulted only at large node counts. I could run the app with debug version of mpich and get a better backtrace

hzhou commented 1 week ago

@abagusetty Yeah, that will be helpful. I am curious on which line that segfaults.

raffenet commented 1 week ago

Here's a full backtrace from a debug build of main. The request being canceled at step 0 is the one that was matched in step 12.

(gdb) bt
#0  0x0000147172408960 in MPIDIG_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/src/mpidig_recv.h:377
#1  0x00001471724097d5 in MPIDI_POSIX_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/shm/src/../posix/posix_recv.h:80
#2  0x000014717240885b in MPIDI_SHM_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
    at ./src/mpid/ch4/shm/src/shm_p2p.h:94
#3  0x0000147172407ae6 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
    is_cancelled=0x7ffc4845098c) at ./src/mpid/ch4/src/mpidig_request.h:130
#4  0x0000147172407453 in MPIDI_OFI_recv_event (vci=0, wc=0x7ffc48450a80,
    rreq=0x147172e84650 <MPIR_Request_direct+2480>, event_id=2)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:163
#5  0x000014717240719c in MPIDI_OFI_dispatch_optimized (vci=0, wc=0x7ffc48450a80,
    req=0x147172e84650 <MPIR_Request_direct+2480>) at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:205
#6  0x0000147172403a9b in MPIDI_OFI_handle_cq_entries (vci=0, wc=0x7ffc48450a50, num=2)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:61
#7  0x0000147172403273 in MPIDI_NM_progress (vci=0, made_progress=0x7ffc48450c08)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:105
#8  0x0000147172403047 in MPIDI_OFI_progress_uninlined (vci=0) at src/mpid/ch4/netmod/ofi/ofi_progress.c:13
#9  0x0000147172344321 in MPIDI_NM_mpi_cancel_recv (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
    is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:460
#10 0x0000147172343bd0 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e83e90 <MPIR_Request_direct+496>,
    is_cancelled=0x7ffc484510e8) at ./src/mpid/ch4/src/mpidig_request.h:108
#11 0x0000147172336be2 in match_posted_rreq (rank=1, tag=0, context_id=0, vci=0, is_local=true,
    req=0x7ffc48451158) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:225
#12 0x00001471723365f2 in MPIDIG_send_target_msg_cb (am_hdr=0x147157f168f0, data=0x147157f16920,
    in_data_sz=16, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:384
#13 0x00001471721e8219 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#14 0x00001471721e7eca in MPIDI_POSIX_progress (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#15 0x00001471721e7a68 in MPIDI_SHM_progress (vci=0, made_progress=0x7ffc48451460)
    at ./src/mpid/ch4/shm/src/shm_progress.h:18
#16 0x00001471721e6fbc in MPIDI_progress_test (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:142
#17 0x00001471721deafa in MPID_Progress_test (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:241
#18 0x00001471721e0525 in MPID_Progress_wait (state=0x7ffc48451568)
    at ./src/mpid/ch4/src/ch4_progress.h:296
#19 0x00001471721e0446 in MPIR_Wait_state (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
    status=0x7ffc4845175c, state=0x7ffc48451568) at src/mpi/request/request_impl.c:707
#20 0x00001471721e09ae in MPID_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
    status=0x7ffc4845175c) at ./src/mpid/ch4/src/ch4_wait.h:100
#21 0x00001471721e0868 in MPIR_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
--Type <RET> for more, q to quit, c to continue without paging--
    status=0x7ffc4845175c) at src/mpi/request/request_impl.c:750
#22 0x0000147171bbc58a in internal_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2,
    tag=0, comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:117
#23 0x0000147171bbb953 in PMPI_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2, tag=0,
    comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:169
#24 0x0000000000401d11 in main () at foo.c:20
hzhou commented 1 week ago

I believe what was happening is: When shmem matches, it tries to call netmod cancel partner, but netmod can't cancel if it already matched, so it will instead cancel the shmem part.

hzhou commented 1 week ago

@raffenet Can you confirm that line 377 is

 if (!MPIR_Request_is_complete(rreq) &&
        !MPIR_STATUS_GET_CANCEL_BIT(rreq->status) && !MPIDIG_REQUEST_IN_PROGRESS(rreq))

? If so, I suspect it is segfaults in MPIDIG_REQUEST_IN_PROGRESS(rreq), due to the MPIDIG_REQUEST(rreq, req) already freed, maybe in MPIDIG_send_target_msg_cb

hzhou commented 1 week ago

@raffenet If you try remove that branch altogether -- so it leaks -- will the test run?

EDIT: I guess we need the shmem cancel to work. How about just set the condition to true?

raffenet commented 1 week ago

Yes MPIDIG_REQUEST(rreq, req) is NULL according to the backtrace. I'll try and remove the IN_PROGRESS check.

hzhou commented 1 week ago

I guess it is somewhat a recursive situation. In https://github.com/pmodels/mpich/blob/6dc849ee3248fcb288bab1be8f0c4c8f1f30ad19/src/mpid/ch4/netmod/ofi/ofi_recv.h#L456-L467, maybe we should reset anysrc_partner before we call the progress.

raffenet commented 1 week ago

I think we have to do it inside MPIDI_anysrc_try_cancel_partner. Once we have the partner request we can unset it's link back to the original request and then call cancel on it.

hzhou commented 1 week ago

Give it a try? :)

raffenet commented 1 week ago

I will. Lost my session, but this is my thought

diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
                  * ref count here to prevent free since here we will check
                  * the request status */
                 MPIR_Request_add_ref(anysrc_partner);
+                /* unset the partner request's partner to prevent recursive cancelation */
+                anysrc_parter->dev.anysrc_partner = NULL;
                 mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true);     /* blocking */
                 MPIR_ERR_CHECK(mpi_errno);
                 if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {
raffenet commented 1 week ago

I will. Lost my session, but this is my thought

diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
                  * ref count here to prevent free since here we will check
                  * the request status */
                 MPIR_Request_add_ref(anysrc_partner);
+                /* unset the partner request's partner to prevent recursive cancelation */
+                anysrc_parter->dev.anysrc_partner = NULL;
                 mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true);     /* blocking */
                 MPIR_ERR_CHECK(mpi_errno);
                 if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {

This just causes a deadlock at the first anysrc partner cancel operation 😦. I'll try doing it in the netmod layer before calling it for the night.

abagusetty commented 44 minutes ago

@raffenet I just ran the app on a 4k nodes of Aurora using your PR(built today) and hit a segfault with slightly different backtrace than the one posted above:

#0  0x0000149604324ca3 in MPIDIG_mpi_cancel_recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#1  0x0000149604326f0b in MPIDI_OFI_handle_cq_entries () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#2  0x00001496043268bc in MPIDI_NM_progress () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#3  0x0000149604325656 in MPIDI_progress_test () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#4  0x000014960432293e in MPIR_Wait () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#5  0x0000149604145fe2 in PMPI_Recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#6  0x0000149617a5f4e9 in _progress_server ()
    at /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/exachemdev_mpipr_11-22-2024/TAMM/build_2024.07.30.002-agama996.26-gitmpich/GlobalArrays_External-prefix/src/GlobalArrays_External/comex/src-mpi-pr/comex.c:3429
#7  0x0000149617a4ca28 in _comex_init (comm=comm@entry=1140850688)

The complaining API from the app side is still the same pointing to any_source usage.

abagusetty commented 44 minutes ago

Not sure if I was prematurely testing the PR

raffenet commented 4 minutes ago

@abagusetty thanks for trying it out. I will see if I can reproduce and update the PR if I find the issue.