Closed alejandro-all closed 9 years ago
@jsquyres Could you please take a look?
looks like all v1.8 does not work this is likely due to the different implementation of ompi_free_list_t that was not reflected into the dbg plugin. I will try to dig a bit more
@alejandro-all with Open MPI v1.8 and above, the debugger does not "see" the message that are being sent by MPI_Send or received by MPI_Recv if you replace
MPI_Send(...)
MPI_Recv(...)
with
{ MPI_Request req; MPI_Isend(...,&req); MPI_Wait(req,...); }
{ MPI_Request req; MPI_Irecv(...,&req); MPI_Wait(req,...); }
then the debugger can find the pending messages.
the root cause is the debugger plugin is not (yet) aware of an optimisation in mca_pml_ob1_send/mca_pml_ob1_recv in which the request is now allocated on the stack, instead of being retrieved from an ompi_free_list_t.
a possible solution is to add two new symbols
mca_pml_base_send_request_t * mca_pml_base_pending_send_request = NULL;
mca_pml_base_recv_request_t * mca_pml_base_pending_recv_request = NULL;
and make sure the iterator checks if there is something here on top of the mca_pmlbase{send,recv}_requests lists
/* and use array to support MPI_THREAD_MULTIPLE */
an other option is to add an other MCA param so MCA_PML_OB1_RECV_REQUEST_ALLOC is always used.
a variant is to using alloca when MPIR_being_debugged is false, and MCA_PML_OB1_RECV_REQUEST_ALLOC otherwise.
so far, MPIR_being_debugged is always false and i could not even find a callback (mqs_store_data ?) to have it set automatically by mqs_setup_process
@jsquyres is there a way to have the debugger plugin automatically set MPIR_being_debugged ? or should i expect the debugger itself (e.g. ddt, totalview, ...) do that automatically ? /* and assuming there is one, that does not handle messages sent/received before the task was attached by the debugger */
in the mean time, here is attached a proof of concept (to be applied to ompi v1.8) when starting a mpi program under the debugger, it is necessary to manually set MPIR_being_debugged to a non zero value. (unless of course the debugger is modified to do that automatically)
@alejandro-all could you please give it a try and report any remaining issues ?
diff --git a/ompi/mca/pml/ob1/pml_ob1_irecv.c b/ompi/mca/pml/ob1/pml_ob1_irecv.c
index 88174dd..71f1ec0 100644
--- a/ompi/mca/pml/ob1/pml_ob1_irecv.c
+++ b/ompi/mca/pml/ob1/pml_ob1_irecv.c
@@ -15,6 +15,8 @@
* Copyright (c) 2010-2012 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2011 Sandia National Laboratories. All rights reserved.
* Copyright (c) 2014 Cisco Systems, Inc. All rights reserved.
+ * Copyright (c) 2015 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -28,6 +30,7 @@
#include "pml_ob1_recvfrag.h"
#include "ompi/peruse/peruse-internal.h"
#include "ompi/message/message.h"
+#include "ompi/debuggers/debuggers.h"
#if HAVE_ALLOCA_H
#include <alloca.h>
#endif /* HAVE_ALLOCA_H */
@@ -92,11 +95,17 @@ int mca_pml_ob1_recv(void *addr,
struct ompi_communicator_t *comm,
ompi_status_public_t * status)
{
- mca_pml_ob1_recv_request_t *recvreq =
- alloca(mca_pml_base_recv_requests.fl_frag_size);
+ mca_pml_ob1_recv_request_t *recvreq;
int rc;
- OBJ_CONSTRUCT(recvreq, mca_pml_ob1_recv_request_t);
+ if (MPIR_being_debugged) {
+ MCA_PML_OB1_RECV_REQUEST_ALLOC(recvreq);
+ if (NULL == recvreq)
+ return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+ } else {
+ recvreq = alloca(mca_pml_base_recv_requests.fl_frag_size);
+ OBJ_CONSTRUCT(recvreq, mca_pml_ob1_recv_request_t);
+ }
MCA_PML_OB1_RECV_REQUEST_INIT(recvreq, addr, count, datatype,
src, tag, comm, false);
@@ -113,8 +122,12 @@ int mca_pml_ob1_recv(void *addr,
}
rc = recvreq->req_recv.req_base.req_ompi.req_status.MPI_ERROR;
- MCA_PML_BASE_RECV_REQUEST_FINI(&recvreq->req_recv);
- OBJ_DESTRUCT(recvreq);
+ if (MPIR_being_debugged) {
+ MCA_PML_OB1_RECV_REQUEST_RETURN(recvreq);
+ } else {
+ MCA_PML_BASE_RECV_REQUEST_FINI(&recvreq->req_recv);
+ OBJ_DESTRUCT(recvreq);
+ }
return rc;
}
diff --git a/ompi/mca/pml/ob1/pml_ob1_isend.c b/ompi/mca/pml/ob1/pml_ob1_isend.c
index 66734b6..723f79b 100644
--- a/ompi/mca/pml/ob1/pml_ob1_isend.c
+++ b/ompi/mca/pml/ob1/pml_ob1_isend.c
@@ -28,6 +28,7 @@
#include "pml_ob1_sendreq.h"
#include "pml_ob1_recvreq.h"
#include "ompi/peruse/peruse-internal.h"
+#include "ompi/debuggers/debuggers.h"
#if HAVE_ALLOCA_H
#include <alloca.h>
#endif /* HAVE_ALLOCA_H */
@@ -189,8 +190,7 @@ int mca_pml_ob1_send(void *buf,
ompi_proc_t *dst_proc = ompi_comm_peer_lookup (comm, dst);
mca_bml_base_endpoint_t* endpoint = (mca_bml_base_endpoint_t*)
dst_proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_BML];
- mca_pml_ob1_send_request_t *sendreq =
- alloca(mca_pml_base_send_requests.fl_frag_size);
+ mca_pml_ob1_send_request_t *sendreq;
int16_t seqn;
int rc;
@@ -222,7 +222,14 @@ int mca_pml_ob1_send(void *buf,
}
}
- OBJ_CONSTRUCT(sendreq, mca_pml_ob1_send_request_t);
+ if (MPIR_being_debugged) {
+ MCA_PML_OB1_SEND_REQUEST_ALLOC(comm, dst, sendreq);
+ if (NULL == sendreq)
+ return OMPI_ERR_TEMP_OUT_OF_RESOURCE;
+ } else {
+ sendreq = alloca(mca_pml_base_send_requests.fl_frag_size);
+ OBJ_CONSTRUCT(sendreq, mca_pml_ob1_send_request_t);
+ }
sendreq->req_send.req_base.req_proc = dst_proc;
sendreq->src_des = NULL;
@@ -242,9 +249,15 @@ int mca_pml_ob1_send(void *buf,
ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi);
rc = sendreq->req_send.req_base.req_ompi.req_status.MPI_ERROR;
- MCA_PML_BASE_SEND_REQUEST_FINI(&sendreq->req_send);
+ if (MPIR_being_debugged) {
+ MCA_PML_OB1_SEND_REQUEST_RETURN(sendreq);
+ } else {
+ MCA_PML_BASE_RECV_REQUEST_FINI(&sendreq->req_send);
+ }
+ }
+ if (!MPIR_being_debugged) {
+ OBJ_DESTRUCT(sendreq);
}
- OBJ_DESTRUCT(sendreq);
return rc;
}
@ggouaillardet there is a comment in orterun.c:2421 about the integration of MPIR_being_debugged and the different parallel debuggers available.
I'm not keen on having changes in behavior of ompi - esp. in the critical path for messages - when using a debugger.
I think @ggouaillardet's patch is in the right spirit, but not quite correct (I have a license for DDT, so I can test the message queue functionality). He's right that the optimization to alloca()
is not putting the pending requests in a place that the debugger can see.
However, I cannot find any variable that is set by the debugger in the MPI processes indicating that a debugger has attached (MPIR_being_debugged
and friends are set in mpirun
, not the MPI processes). According to http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf page 13, the only 2 variables set in the MPI processes are MPIR_debug_gate
and MPIR_debug_dll
, neither of which are useful here (MPIR_debug_gate
is only used in MPI_INIT; it is not used if the debugger attaches after MPI_INIT has completed).
I talked to @rhc54 briefly this afternoon, and he swears that there should be a volatile variable set in the MPI process indicating whether a debugger is currently attached or not. But I can't find any such variable (I even tried creating MPIR_debugger_attached
, and it was never set to 1 by DDT).
However, I think there may be a workaround: in mqs_setup_processes()
in ompi_msgq_dll.c
, we can set an OMPI-specific global volatile variable to 1 in the MPI process -- i.e., the equivalent of ompi_debugger_attached=1
. And then mqs_destroy_process_info()
can reset this value back to 0 -- i.e., the equivalent of ompi_debugger_attached=0
.
I'll see if I can make this work.
If it works, it would allow us to use something similar to @ggouaillardet's patch. However, there's two problem with the patch as it is:
MCA_PML_BASE_RECV_REQUEST_FINI
in the final hunk be MCA_PML_BASE_SEND_REQUEST_FINI
? I think that's a typo in the patch.OB1_SEND_REQUEST_ALLOC
ed but then OB1_SEND_REQUEST_FINI
ed (instead of RETURN
ed), or vice versa. I think that when the request is created, it needs to save a flag indicating whether it was malloc'ed or alloca'ed so that it knows how to free itself when it is done....well, it turns out that that was a lovely theory, but there's no mqs
functionality to set a value in a process (you can only read values). :-\
So I'm not quite sure what the right answer is here. We know what the problem is: that the pending blocking requests are not being put on the global lists (because it's useful/faster to not do so). Should we have an MCA param to disable this behavior (i.e., effectively s/MPIR_being_debugged/ompi_want_debugger_msgq/
, or somesuch)?
I think the env. variable approach is okay. What I would not want, and sounds like its not posdible anyway , would be for ompi behavior to change under. the covers when run with debugger attached.
sent from my smart phonr so no good type.
Howard On Jun 8, 2015 3:14 PM, "Jeff Squyres" notifications@github.com wrote:
...well, it turns out that that was a lovely theory, but there's no mqs functionality to set a value in a process (you can only read values). :-\
So I'm not quite sure what the right answer is here. We know what the problem is: that the pending blocking requests are not being put on the global lists (because it's useful/faster to not do so). Should we have an MCA param to disable this behavior (i.e., effectively s/MPIR_being_debugged/ompi_want_debugger_msgq/, or somesuch)?
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-110142318.
@bosilca already pointed to orterun.c:2421 it seems MPIR_being_debugged is not required to be set by the debugger. (totalview does it, ddt does not)
anyway, MPIR_being_debugged is only good for tasks started under the debugger. (e.g. if i attach a mpi app to debug a deadlock, some send/recv might be missing from the debugger queue)
i will revamp my patch, investigate the typo, and add a new MCA variable
Well, the MCA var is fine - for when you are launched via mpirun and know you are being debugged. However, it is useless for the case where you are trying to attach a debugger to a running job. So as long as we somehow explain to users that attaching to a running job might not work, this is okay.
i'd rather put it this way : before running an app, and if the user think he/she might want to attach it with a debugger, then use the MCA param to make sure no send/recv will be missing. if not, then some send/recv might be missing
an other option is to maintain a "list" for these requests to make sure they will never be missing. if not MPI_THREAD_MULTIPLE, then the "list" is just a pointer. if MPI_THREAD_MULTIPLE, the "list" can be a double linked list or an array of pointer (index is the thread id) of course, there is some overhead with this approach, and this conflicts with the alloca optimization...
I'll bring this up at the call tomorrow - this is a very significant change in behavior that is going to cause some considerable upset. Most users attach debuggers when their app runs into a problem - requiring that they kill the app and restart it under the debugger means that it may not reproduce the problem.
We should find a better solution, even it if means dumping that optimization.
I agree with Ralph's last comment. Excluding the vader btl, what does this optimization buy us anyway?
2015-06-08 19:46 GMT-06:00 rhc54 notifications@github.com:
I'll bring this up at the call tomorrow - this is a very significant change in behavior that is going to cause some considerable upset. Most users attach debuggers when their app runs into a problem - requiring that they kill the app and restart it under the debugger means that it may not reproduce the problem.
We should find a better solution, even it if means dumping that optimization.
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-110193202.
@hppritcha this optimization is at pml/ob1, so it is btl agnostic
here is the related git-log (the first part is relevant here)
@hjelmn feel free to comment ...
commit 2b57f4227ebfcfd5f86fdb6ac37c6bb81cecf521
Author: Nathan Hjelm <hjelmn@lanl.gov>
Date: Tue Jan 21 15:16:21 2014 +0000
ob1: optimize blocking send and receive paths
Per RFC. There are two optimizations in this commit:
- Allocate requests for blocking sends and receives on the stack. This
bypasses the request free list and saves two atomics on the critical path.
This change improves the small message ping-pong by 50-200ns on both AMD
and Intel CPUs.
- For small messages try to use the btl sendi function before intializing a
send request. If the sendi fails or the btl does not have a sendi function
silently fallback on the standard send path.
cmr=v1.7.5:reviewer=brbarret
This commit was SVN r30343.
@ggouaillardet We talked about this on the call today. Here's the suggested resolution:
Branch topic/message_queues in @bosilca/ompi. It should work in single threaded case, the thread multiple fix will come shortly. However, due to the existence of the immediate send, there are cases where a request is not even allocated in the send path, so this case is not covered at all. One should notice that the immediate sends are supposed to be extremely quick if they succeed, so this might not be a big issue.
@bosilca will you be bringing this across soon? I'm being asked about when 1.8.6 is going to release, and would like to do rc2 this weekend if possible.
I'm waiting for someone (hint @jsquyres) to validate the existing patch. I have some troubles with tv8 on my environment, I would definitively appreciate some help with the testing.
If you point me to the patched source code I can test this on DDT.
It was in the earlier note: https://github.com/bosilca/ompi/tree/topic/message_queues
Sorry for the delay. I'm compiling / testing now...
The functionality seems to work with single-threaded MPI apps (excluding unexpected messages, of course... we should file a ticket for that...).
I left some minor comments on the commit itself, and asked about THREAD_MULTIPLE support: don't we need to handle that case (perhaps sub-optimally) on v1.8/v10.0, too?
Support for THREAD_MULTIPLE has been added, as well as OPAL_UNLIKELY.
On Fri, Jun 12, 2015 at 3:31 PM, Jeff Squyres notifications@github.com wrote:
The functionality seems to work with single-threaded MPI apps (excluding unexpected messages, of course... we should file a ticket for that...).
I left some minor comments on the commit itself, and asked about THREAD_MULTIPLE support: don't we need to handle that case (perhaps sub-optimally) on v1.8/v10.0, too?
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-111595449.
Looks good; thanks. Will you file a PR for v1.8 and v1.10?
Feel free to do the PR, I'll be off the grid 'til Sunday.
On Fri, Jun 12, 2015 at 4:31 PM, Jeff Squyres notifications@github.com wrote:
Looks good; thanks. Will you file a PR for v1.8 and v1.10?
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/618#issuecomment-111606324.
@bosilca @jsquyres I have it queued up as PR to master. Will PR for the release branches once that is merged
Fixed by #641 on master. Ralph is filing separate PRs for v1.8 and v1.10.
1.8.6 tested on Allinea DDT and the issue is fixed. Thank you guys!
We (Allinea) are experiencing a problem when debugging the message queues:
Summary: Message queues debugging broken on recent OpenMPI versions.
Affected OpenMPI versions: 1.8.3, 1.8.4 and 1.8.5 (at least). The debug message queue library is not returning any pending messages. This does not happen with previous versions of OpenMPI, as some processes are blocked in MPI_Receive.
The function resolution (which message queue we want) is selected by an enum 'mqs_op_class': Values are: mqs_pending_sends, mqs_pending_receives and mqs_unexpected_messages.
When setting the corresponding queue iterator (with "mqs_setup_operation_iterator"), the return value is zero (==mqs_ok), but when we try to read the list (with "mqs_next_operation"), the return value is always non-zero, so no pending messages are found.
This happens in different platforms. Also symptoms can be seen in TotalView and Allinea DDT.