Open abagusetty opened 1 week ago
Thanks! @abagusetty
@abagusetty How did you get the backtrace? Do you have the location of the segfault?
@hzhou The backtrace was generated from a core-dump on Aurora that segfaulted only at large node counts. I could run the app with debug version of mpich and get a better backtrace
@abagusetty Yeah, that will be helpful. I am curious on which line that segfaults.
Here's a full backtrace from a debug build of main
. The request being canceled at step 0 is the one that was matched in step 12.
(gdb) bt
#0 0x0000147172408960 in MPIDIG_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/src/mpidig_recv.h:377
#1 0x00001471724097d5 in MPIDI_POSIX_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/shm/src/../posix/posix_recv.h:80
#2 0x000014717240885b in MPIDI_SHM_mpi_cancel_recv (rreq=0x147172e83e90 <MPIR_Request_direct+496>)
at ./src/mpid/ch4/shm/src/shm_p2p.h:94
#3 0x0000147172407ae6 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
is_cancelled=0x7ffc4845098c) at ./src/mpid/ch4/src/mpidig_request.h:130
#4 0x0000147172407453 in MPIDI_OFI_recv_event (vci=0, wc=0x7ffc48450a80,
rreq=0x147172e84650 <MPIR_Request_direct+2480>, event_id=2)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:163
#5 0x000014717240719c in MPIDI_OFI_dispatch_optimized (vci=0, wc=0x7ffc48450a80,
req=0x147172e84650 <MPIR_Request_direct+2480>) at ./src/mpid/ch4/netmod/include/../ofi/ofi_events.h:205
#6 0x0000147172403a9b in MPIDI_OFI_handle_cq_entries (vci=0, wc=0x7ffc48450a50, num=2)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:61
#7 0x0000147172403273 in MPIDI_NM_progress (vci=0, made_progress=0x7ffc48450c08)
at ./src/mpid/ch4/netmod/include/../ofi/ofi_progress.h:105
#8 0x0000147172403047 in MPIDI_OFI_progress_uninlined (vci=0) at src/mpid/ch4/netmod/ofi/ofi_progress.c:13
#9 0x0000147172344321 in MPIDI_NM_mpi_cancel_recv (rreq=0x147172e84650 <MPIR_Request_direct+2480>,
is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:460
#10 0x0000147172343bd0 in MPIDI_anysrc_try_cancel_partner (rreq=0x147172e83e90 <MPIR_Request_direct+496>,
is_cancelled=0x7ffc484510e8) at ./src/mpid/ch4/src/mpidig_request.h:108
#11 0x0000147172336be2 in match_posted_rreq (rank=1, tag=0, context_id=0, vci=0, is_local=true,
req=0x7ffc48451158) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:225
#12 0x00001471723365f2 in MPIDIG_send_target_msg_cb (am_hdr=0x147157f168f0, data=0x147157f16920,
in_data_sz=16, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:384
#13 0x00001471721e8219 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#14 0x00001471721e7eca in MPIDI_POSIX_progress (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#15 0x00001471721e7a68 in MPIDI_SHM_progress (vci=0, made_progress=0x7ffc48451460)
at ./src/mpid/ch4/shm/src/shm_progress.h:18
#16 0x00001471721e6fbc in MPIDI_progress_test (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:142
#17 0x00001471721deafa in MPID_Progress_test (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:241
#18 0x00001471721e0525 in MPID_Progress_wait (state=0x7ffc48451568)
at ./src/mpid/ch4/src/ch4_progress.h:296
#19 0x00001471721e0446 in MPIR_Wait_state (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
status=0x7ffc4845175c, state=0x7ffc48451568) at src/mpi/request/request_impl.c:707
#20 0x00001471721e09ae in MPID_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
status=0x7ffc4845175c) at ./src/mpid/ch4/src/ch4_wait.h:100
#21 0x00001471721e0868 in MPIR_Wait (request_ptr=0x147172e83e90 <MPIR_Request_direct+496>,
--Type <RET> for more, q to quit, c to continue without paging--
status=0x7ffc4845175c) at src/mpi/request/request_impl.c:750
#22 0x0000147171bbc58a in internal_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2,
tag=0, comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:117
#23 0x0000147171bbb953 in PMPI_Recv (buf=0x7ffc48451770, count=4, datatype=1275069445, source=-2, tag=0,
comm=1140850688, status=0x7ffc4845175c) at src/binding/c/pt2pt/recv.c:169
#24 0x0000000000401d11 in main () at foo.c:20
I believe what was happening is: When shmem matches, it tries to call netmod cancel partner, but netmod can't cancel if it already matched, so it will instead cancel the shmem part.
@raffenet Can you confirm that line 377 is
if (!MPIR_Request_is_complete(rreq) &&
!MPIR_STATUS_GET_CANCEL_BIT(rreq->status) && !MPIDIG_REQUEST_IN_PROGRESS(rreq))
?
If so, I suspect it is segfaults in MPIDIG_REQUEST_IN_PROGRESS(rreq)
, due to the MPIDIG_REQUEST(rreq, req)
already freed, maybe in MPIDIG_send_target_msg_cb
@raffenet If you try remove that branch altogether -- so it leaks -- will the test run?
EDIT: I guess we need the shmem cancel to work. How about just set the condition to true
?
Yes MPIDIG_REQUEST(rreq, req)
is NULL according to the backtrace. I'll try and remove the IN_PROGRESS check.
I guess it is somewhat a recursive situation. In https://github.com/pmodels/mpich/blob/6dc849ee3248fcb288bab1be8f0c4c8f1f30ad19/src/mpid/ch4/netmod/ofi/ofi_recv.h#L456-L467, maybe we should reset anysrc_partner
before we call the progress.
I think we have to do it inside MPIDI_anysrc_try_cancel_partner
. Once we have the partner request we can unset it's link back to the original request and then call cancel on it.
Give it a try? :)
I will. Lost my session, but this is my thought
diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h
index 8c2d374e..8e0f16fb 100644
--- a/src/mpid/ch4/src/mpidig_request.h
+++ b/src/mpid/ch4/src/mpidig_request.h
@@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq
* ref count here to prevent free since here we will check
* the request status */
MPIR_Request_add_ref(anysrc_partner);
+ /* unset the partner request's partner to prevent recursive cancelation */
+ anysrc_parter->dev.anysrc_partner = NULL;
mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true); /* blocking */
MPIR_ERR_CHECK(mpi_errno);
if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {
I will. Lost my session, but this is my thought
diff --git a/src/mpid/ch4/src/mpidig_request.h b/src/mpid/ch4/src/mpidig_request.h index 8c2d374e..8e0f16fb 100644 --- a/src/mpid/ch4/src/mpidig_request.h +++ b/src/mpid/ch4/src/mpidig_request.h @@ -105,6 +105,8 @@ MPL_STATIC_INLINE_PREFIX int MPIDI_anysrc_try_cancel_partner(MPIR_Request * rreq * ref count here to prevent free since here we will check * the request status */ MPIR_Request_add_ref(anysrc_partner); + /* unset the partner request's partner to prevent recursive cancelation */ + anysrc_parter->dev.anysrc_partner = NULL; mpi_errno = MPIDI_NM_mpi_cancel_recv(anysrc_partner, true); /* blocking */ MPIR_ERR_CHECK(mpi_errno); if (!MPIR_STATUS_GET_CANCEL_BIT(anysrc_partner->status)) {
This just causes a deadlock at the first anysrc partner cancel operation 😦. I'll try doing it in the netmod layer before calling it for the night.
@raffenet I just ran the app on a 4k nodes of Aurora using your PR(built today) and hit a segfault with slightly different backtrace than the one posted above:
#0 0x0000149604324ca3 in MPIDIG_mpi_cancel_recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#1 0x0000149604326f0b in MPIDI_OFI_handle_cq_entries () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#2 0x00001496043268bc in MPIDI_NM_progress () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#3 0x0000149604325656 in MPIDI_progress_test () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#4 0x000014960432293e in MPIR_Wait () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#5 0x0000149604145fe2 in PMPI_Recv () from /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/software/install_main_11-21-2024/lib/libmpi.so.0
#6 0x0000149617a5f4e9 in _progress_server ()
at /lus/flare/projects/Aurora_deployment/apps_rfm/NWChemEx/exachemdev_mpipr_11-22-2024/TAMM/build_2024.07.30.002-agama996.26-gitmpich/GlobalArrays_External-prefix/src/GlobalArrays_External/comex/src-mpi-pr/comex.c:3429
#7 0x0000149617a4ca28 in _comex_init (comm=comm@entry=1140850688)
The complaining API from the app side is still the same pointing to any_source usage.
Not sure if I was prematurely testing the PR
@abagusetty thanks for trying it out. I will see if I can reproduce and update the PR if I find the issue.
Thanks to @raffenet @hzhou for figuring it out. Creating the reproducer that was created by @raffenet Needs an Aurora label. Adding info from internal slack:
Also reproducible with upstream commits. Backtrace from an app running with the commit: https://github.com/pmodels/mpich/commit/204f8cd396837dc7be1e693484ca1a56ef9d90b4
Reproducer: