Message queues debugging not working

alejandro-all commented 9 years ago

We (Allinea) are experiencing a problem when debugging the message queues:

Summary: Message queues debugging broken on recent OpenMPI versions.

Affected OpenMPI versions: 1.8.3, 1.8.4 and 1.8.5 (at least). The debug message queue library is not returning any pending messages. This does not happen with previous versions of OpenMPI, as some processes are blocked in MPI_Receive.

The function resolution (which message queue we want) is selected by an enum 'mqs_op_class': Values are: mqs_pending_sends, mqs_pending_receives and mqs_unexpected_messages.

When setting the corresponding queue iterator (with "mqs_setup_operation_iterator"), the return value is zero (==mqs_ok), but when we try to read the list (with "mqs_next_operation"), the return value is always non-zero, so no pending messages are found.

This happens in different platforms. Also symptoms can be seen in TotalView and Allinea DDT.

rhc54 commented 9 years ago

@jsquyres Could you please take a look?

ggouaillardet commented 9 years ago

looks like all v1.8 does not work this is likely due to the different implementation of ompi_free_list_t that was not reflected into the dbg plugin. I will try to dig a bit more

ggouaillardet commented 9 years ago

@alejandro-all with Open MPI v1.8 and above, the debugger does not "see" the message that are being sent by MPI_Send or received by MPI_Recv if you replace

MPI_Send(...)
MPI_Recv(...)

with

{ MPI_Request req; MPI_Isend(...,&req); MPI_Wait(req,...); }
{ MPI_Request req; MPI_Irecv(...,&req); MPI_Wait(req,...); }

then the debugger can find the pending messages.

the root cause is the debugger plugin is not (yet) aware of an optimisation in mca_pml_ob1_send/mca_pml_ob1_recv in which the request is now allocated on the stack, instead of being retrieved from an ompi_free_list_t.

a possible solution is to add two new symbols

mca_pml_base_send_request_t * mca_pml_base_pending_send_request = NULL;
mca_pml_base_recv_request_t * mca_pml_base_pending_recv_request = NULL;

and make sure the iterator checks if there is something here on top of the mca_pmlbase{send,recv}_requests lists

/* and use array to support MPI_THREAD_MULTIPLE */

an other option is to add an other MCA param so MCA_PML_OB1_RECV_REQUEST_ALLOC is always used.

a variant is to using alloca when MPIR_being_debugged is false, and MCA_PML_OB1_RECV_REQUEST_ALLOC otherwise.

so far, MPIR_being_debugged is always false and i could not even find a callback (mqs_store_data ?) to have it set automatically by mqs_setup_process

@jsquyres is there a way to have the debugger plugin automatically set MPIR_being_debugged ? or should i expect the debugger itself (e.g. ddt, totalview, ...) do that automatically ? /* and assuming there is one, that does not handle messages sent/received before the task was attached by the debugger */

in the mean time, here is attached a proof of concept (to be applied to ompi v1.8) when starting a mpi program under the debugger, it is necessary to manually set MPIR_being_debugged to a non zero value. (unless of course the debugger is modified to do that automatically)

@alejandro-all could you please give it a try and report any remaining issues ?

diff --git a/ompi/mca/pml/ob1/pml_ob1_irecv.c b/ompi/mca/pml/ob1/pml_ob1_irecv.c
index 88174dd..71f1ec0 100644
--- a/ompi/mca/pml/ob1/pml_ob1_irecv.c
+++ b/ompi/mca/pml/ob1/pml_ob1_irecv.c
@@ -15,6 +15,8 @@
  * Copyright (c) 2010-2012 Oracle and/or its affiliates.  All rights reserved.
  * Copyright (c) 2011      Sandia National Laboratories. All rights reserved.
  * Copyright (c) 2014 Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2015      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -28,6 +30,7 @@
 #include "pml_ob1_recvfrag.h"
 #include "ompi/peruse/peruse-internal.h"
 #include "ompi/message/message.h"
+#include "ompi/debuggers/debuggers.h"
 #if HAVE_ALLOCA_H
 #include <alloca.h>
 #endif  /* HAVE_ALLOCA_H */
@@ -92,11 +95,17 @@ int mca_pml_ob1_recv(void *addr,
                      struct ompi_communicator_t *comm,
                      ompi_status_public_t * status)
 {
-    mca_pml_ob1_recv_request_t *recvreq =
-        alloca(mca_pml_base_recv_requests.fl_frag_size);
+    mca_pml_ob1_recv_request_t *recvreq;
     int rc;

-    OBJ_CONSTRUCT(recvreq, mca_pml_ob1_recv_request_t);
+    if (MPIR_being_debugged) {
+        MCA_PML_OB1_RECV_REQUEST_ALLOC(recvreq);
+        if (NULL == recvreq)
+            return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+    } else {
+        recvreq = alloca(mca_pml_base_recv_requests.fl_frag_size);
+        OBJ_CONSTRUCT(recvreq, mca_pml_ob1_recv_request_t);
+    }

     MCA_PML_OB1_RECV_REQUEST_INIT(recvreq, addr, count, datatype,
                                   src, tag, comm, false);
@@ -113,8 +122,12 @@ int mca_pml_ob1_recv(void *addr,
     }

     rc = recvreq->req_recv.req_base.req_ompi.req_status.MPI_ERROR;
-    MCA_PML_BASE_RECV_REQUEST_FINI(&recvreq->req_recv);
-    OBJ_DESTRUCT(recvreq);
+    if (MPIR_being_debugged) {
+        MCA_PML_OB1_RECV_REQUEST_RETURN(recvreq);
+    } else {
+        MCA_PML_BASE_RECV_REQUEST_FINI(&recvreq->req_recv);
+        OBJ_DESTRUCT(recvreq);
+    }

     return rc;
 }
diff --git a/ompi/mca/pml/ob1/pml_ob1_isend.c b/ompi/mca/pml/ob1/pml_ob1_isend.c
index 66734b6..723f79b 100644
--- a/ompi/mca/pml/ob1/pml_ob1_isend.c
+++ b/ompi/mca/pml/ob1/pml_ob1_isend.c
@@ -28,6 +28,7 @@
 #include "pml_ob1_sendreq.h"
 #include "pml_ob1_recvreq.h"
 #include "ompi/peruse/peruse-internal.h"
+#include "ompi/debuggers/debuggers.h"
 #if HAVE_ALLOCA_H
 #include <alloca.h>
 #endif  /* HAVE_ALLOCA_H */
@@ -189,8 +190,7 @@ int mca_pml_ob1_send(void *buf,
     ompi_proc_t *dst_proc = ompi_comm_peer_lookup (comm, dst);
     mca_bml_base_endpoint_t* endpoint = (mca_bml_base_endpoint_t*)
                                         dst_proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML];
-    mca_pml_ob1_send_request_t *sendreq =
-        alloca(mca_pml_base_send_requests.fl_frag_size);
+    mca_pml_ob1_send_request_t *sendreq; 
     int16_t seqn;
     int rc;

@@ -222,7 +222,14 @@ int mca_pml_ob1_send(void *buf,
         }
     }

-    OBJ_CONSTRUCT(sendreq, mca_pml_ob1_send_request_t);
+    if (MPIR_being_debugged) {
+        MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq);
+        if (NULL == sendreq)
+            return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+    } else {
+        sendreq = alloca(mca_pml_base_send_requests.fl_frag_size);
+        OBJ_CONSTRUCT(sendreq, mca_pml_ob1_send_request_t);
+    }
     sendreq->req_send.req_base.req_proc = dst_proc;
     sendreq->src_des = NULL;

@@ -242,9 +249,15 @@ int mca_pml_ob1_send(void *buf,
         ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);

         rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
-        MCA_PML_BASE_SEND_REQUEST_FINI(&sendreq->req_send);
+        if (MPIR_being_debugged) {
+            MCA_PML_OB1_SEND_REQUEST_RETURN(sendreq);
+        } else {
+            MCA_PML_BASE_RECV_REQUEST_FINI(&sendreq->req_send);
+        }
+    }
+    if (!MPIR_being_debugged) {
+        OBJ_DESTRUCT(sendreq);
     }
-    OBJ_DESTRUCT(sendreq);

     return rc;
 }

bosilca commented 9 years ago

@ggouaillardet there is a comment in orterun.c:2421 about the integration of MPIR_being_debugged and the different parallel debuggers available.

hppritcha commented 9 years ago

I'm not keen on having changes in behavior of ompi - esp. in the critical path for messages - when using a debugger.

jsquyres commented 9 years ago

I think @ggouaillardet's patch is in the right spirit, but not quite correct (I have a license for DDT, so I can test the message queue functionality). He's right that the optimization to alloca() is not putting the pending requests in a place that the debugger can see.

However, I cannot find any variable that is set by the debugger in the MPI processes indicating that a debugger has attached (MPIR_being_debugged and friends are set in mpirun, not the MPI processes). According to http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf page 13, the only 2 variables set in the MPI processes are MPIR_debug_gate and MPIR_debug_dll, neither of which are useful here (MPIR_debug_gate is only used in MPI_INIT; it is not used if the debugger attaches after MPI_INIT has completed).

I talked to @rhc54 briefly this afternoon, and he swears that there should be a volatile variable set in the MPI process indicating whether a debugger is currently attached or not. But I can't find any such variable (I even tried creating MPIR_debugger_attached, and it was never set to 1 by DDT).

However, I think there may be a workaround: in mqs_setup_processes() in ompi_msgq_dll.c, we can set an OMPI-specific global volatile variable to 1 in the MPI process -- i.e., the equivalent of ompi_debugger_attached=1. And then mqs_destroy_process_info() can reset this value back to 0 -- i.e., the equivalent of ompi_debugger_attached=0.

I'll see if I can make this work.

If it works, it would allow us to use something similar to @ggouaillardet's patch. However, there's two problem with the patch as it is:

Shouldn't the MCA_PML_BASE_RECV_REQUEST_FINI in the final hunk be MCA_PML_BASE_SEND_REQUEST_FINI? I think that's a typo in the patch.
The debugger could be attached when the request is created and then detached when the request is freed. In Gilles' patch, the request will be OB1_SEND_REQUEST_ALLOCed but then OB1_SEND_REQUEST_FINIed (instead of RETURNed), or vice versa. I think that when the request is created, it needs to save a flag indicating whether it was malloc'ed or alloca'ed so that it knows how to free itself when it is done.

jsquyres commented 9 years ago

...well, it turns out that that was a lovely theory, but there's no mqs functionality to set a value in a process (you can only read values). :-\

So I'm not quite sure what the right answer is here. We know what the problem is: that the pending blocking requests are not being put on the global lists (because it's useful/faster to not do so). Should we have an MCA param to disable this behavior (i.e., effectively s/MPIR_being_debugged/ompi_want_debugger_msgq/, or somesuch)?

hppritcha commented 9 years ago

I think the env. variable approach is okay. What I would not want, and sounds like its not posdible anyway , would be for ompi behavior to change under. the covers when run with debugger attached.

sent from my smart phonr so no good type.

Howard On Jun 8, 2015 3:14 PM, "Jeff Squyres" notifications@github.com wrote:

...well, it turns out that that was a lovely theory, but there's no mqs functionality to set a value in a process (you can only read values). :-\

So I'm not quite sure what the right answer is here. We know what the problem is: that the pending blocking requests are not being put on the global lists (because it's useful/faster to not do so). Should we have an MCA param to disable this behavior (i.e., effectively s/MPIR_being_debugged/ompi_want_debugger_msgq/, or somesuch)?

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-110142318.

ggouaillardet commented 9 years ago

@bosilca already pointed to orterun.c:2421 it seems MPIR_being_debugged is not required to be set by the debugger. (totalview does it, ddt does not)

anyway, MPIR_being_debugged is only good for tasks started under the debugger. (e.g. if i attach a mpi app to debug a deadlock, some send/recv might be missing from the debugger queue)

i will revamp my patch, investigate the typo, and add a new MCA variable

rhc54 commented 9 years ago

Well, the MCA var is fine - for when you are launched via mpirun and know you are being debugged. However, it is useless for the case where you are trying to attach a debugger to a running job. So as long as we somehow explain to users that attaching to a running job might not work, this is okay.

ggouaillardet commented 9 years ago

i'd rather put it this way : before running an app, and if the user think he/she might want to attach it with a debugger, then use the MCA param to make sure no send/recv will be missing. if not, then some send/recv might be missing

an other option is to maintain a "list" for these requests to make sure they will never be missing. if not MPI_THREAD_MULTIPLE, then the "list" is just a pointer. if MPI_THREAD_MULTIPLE, the "list" can be a double linked list or an array of pointer (index is the thread id) of course, there is some overhead with this approach, and this conflicts with the alloca optimization...

rhc54 commented 9 years ago

I'll bring this up at the call tomorrow - this is a very significant change in behavior that is going to cause some considerable upset. Most users attach debuggers when their app runs into a problem - requiring that they kill the app and restart it under the debugger means that it may not reproduce the problem.

We should find a better solution, even it if means dumping that optimization.

hppritcha commented 9 years ago

I agree with Ralph's last comment. Excluding the vader btl, what does this optimization buy us anyway?

2015-06-08 19:46 GMT-06:00 rhc54 notifications@github.com:

I'll bring this up at the call tomorrow - this is a very significant change in behavior that is going to cause some considerable upset. Most users attach debuggers when their app runs into a problem - requiring that they kill the app and restart it under the debugger means that it may not reproduce the problem.

We should find a better solution, even it if means dumping that optimization.

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-110193202.

ggouaillardet commented 9 years ago

@hppritcha this optimization is at pml/ob1, so it is btl agnostic

here is the related git-log (the first part is relevant here)

@hjelmn feel free to comment ...

commit 2b57f4227ebfcfd5f86fdb6ac37c6bb81cecf521
Author: Nathan Hjelm <hjelmn@lanl.gov>
Date:   Tue Jan 21 15:16:21 2014 +0000

    ob1: optimize blocking send and receive paths

    Per RFC. There are two optimizations in this commit:

     - Allocate requests for blocking sends and receives on the stack. This
       bypasses the request free list and saves two atomics on the critical path.
       This change improves the small message ping-pong by 50-200ns on both AMD
       and Intel CPUs.

     - For small messages try to use the btl sendi function before intializing a
       send request. If the sendi fails or the btl does not have a sendi function
       silently fallback on the standard send path.

    cmr=v1.7.5:reviewer=brbarret

    This commit was SVN r30343.

jsquyres commented 9 years ago

@ggouaillardet We talked about this on the call today. Here's the suggested resolution:

For v1.8.x / v1.10.x: @bosilca is going to code up a solution that is similar to what you said previously: we'll freelist-allocate a request that can be used for all blocking operations (since there will only be one at a time, since there is no decent THREAD_MULTIPLE support in v1.8/v1.10). We can put in a check that if the request is already active, fall back to the slow path (i.e., freelist-allocate a new request). He should have this ready by tonight / tomorrow morning. I'll test it out with DDT.
For master / v2.x: We might apply the above-mentioned fix as a stopgap, but the better solution is to revamp how we check the queues. We can do this by making all requests have fortran identifiers. Then the debugger message queue DLL doesn't have to traverse freelists -- it only has to traverse the fortran MPI_Request ID array. There is some concern with this approach, however -- it may be expensive to allocate a fortran ID. Need to look into this.

bosilca commented 9 years ago

Branch topic/message_queues in @bosilca/ompi. It should work in single threaded case, the thread multiple fix will come shortly. However, due to the existence of the immediate send, there are cases where a request is not even allocated in the send path, so this case is not covered at all. One should notice that the immediate sends are supposed to be extremely quick if they succeed, so this might not be a big issue.

rhc54 commented 9 years ago

@bosilca will you be bringing this across soon? I'm being asked about when 1.8.6 is going to release, and would like to do rc2 this weekend if possible.

bosilca commented 9 years ago

I'm waiting for someone (hint @jsquyres) to validate the existing patch. I have some troubles with tv8 on my environment, I would definitively appreciate some help with the testing.

alejandro-all commented 9 years ago

If you point me to the patched source code I can test this on DDT.

rhc54 commented 9 years ago

It was in the earlier note: https://github.com/bosilca/ompi/tree/topic/message_queues

jsquyres commented 9 years ago

Sorry for the delay. I'm compiling / testing now...

jsquyres commented 9 years ago

The functionality seems to work with single-threaded MPI apps (excluding unexpected messages, of course... we should file a ticket for that...).

I left some minor comments on the commit itself, and asked about THREAD_MULTIPLE support: don't we need to handle that case (perhaps sub-optimally) on v1.8/v10.0, too?

bosilca commented 9 years ago

Support for THREAD_MULTIPLE has been added, as well as OPAL_UNLIKELY.

On Fri, Jun 12, 2015 at 3:31 PM, Jeff Squyres notifications@github.com wrote:

The functionality seems to work with single-threaded MPI apps (excluding unexpected messages, of course... we should file a ticket for that...).

I left some minor comments on the commit itself, and asked about THREAD_MULTIPLE support: don't we need to handle that case (perhaps sub-optimally) on v1.8/v10.0, too?

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-111595449.

jsquyres commented 9 years ago

Looks good; thanks. Will you file a PR for v1.8 and v1.10?

bosilca commented 9 years ago

Feel free to do the PR, I'll be off the grid 'til Sunday.

On Fri, Jun 12, 2015 at 4:31 PM, Jeff Squyres notifications@github.com wrote:

Looks good; thanks. Will you file a PR for v1.8 and v1.10?

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-111606324.

rhc54 commented 9 years ago

@bosilca @jsquyres I have it queued up as PR to master. Will PR for the release branches once that is merged

jsquyres commented 9 years ago

Fixed by #641 on master. Ralph is filing separate PRs for v1.8 and v1.10.

alejandro-all commented 9 years ago

1.8.6 tested on Allinea DDT and the issue is fixed. Thank you guys!

open-mpi / ompi

Message queues debugging not working #618