open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Petsc test failing: possible MPI_REQUEST_FREE issue #1875

Closed jsquyres closed 8 years ago

jsquyres commented 8 years ago

According to Eric Chamberland in http://www.open-mpi.org/community/lists/devel/2016/07/19210.php, he's getting a failure in a petsc test. Here's the backtrace:

*** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt': free(): invalid pointer: 0x00007f9ab09c6020 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f]
/lib64/libc.so.6(+0x78026)[0x7f9ab01a1026]
/lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1]
/opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc]

@hjelmn @bosilca Could you have a look?

rhc54 commented 8 years ago

I don't know about the overall objective, but we do ship a "libompitrace" that prints out the name of each MPI call when executed so you can "trace" your MPI usage. Might be possible to extend that capability to do what you ask.

ericch1 commented 8 years ago

ok, just linked with "libompitrace"...

the objective would be to transform the output from:

MPI_ISEND[0]: buf 7f746a6c8280 count 3 datatype dest 1 tag 123454321 comm MPI_COMM_WORLD

to something like:

the_good_datatype buf[3] = {the sent buffer}; MPI_Request request;

MPI_Isend(buf,3, the_good_MPI_datatype, 1 tag, MPI_COMM_WORLD, &request);

and then produce a fully functional C code which will reproduce the sequence of MPI calls (and then bugs...).

I can't tel if it will be feasible, but it might worth to think about it...?

ericch1 commented 8 years ago

Tested the "master" branch to test the merged patch, but all tests are faulty because of this problem:

--------------------------------------------------------------------------
The PMIx server was unable to setup a rendezvous file due to your
system's restriction for Unix's socket's path-length.
   Temporary directory: /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/tmp/ompi.lorien.40031/pid.57717
   Rendezvous filename: /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/tmp/ompi.lorien.40031/pid.57717/pmix.lorien.tool.57717
Please try to set TMPDIR to something short (like /tmp) or change
your computer's name to something shorter (see uname -n).
--------------------------------------------------------------------------
[lorien:57703] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 593
[lorien:57703] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 168
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[lorien:57703] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

:(

rhc54 commented 8 years ago

Well, whatever system you are on has an unfortunately long path set to TMPDIR. As the error says, just set TMPDIR to something shorter so it can fit into a standard sockaddr_un.

ericch1 commented 8 years ago

Yeah I cant set it to a shorter path, but I didn't changed that TMPDIR variable since a long tim ago and it was working... is it a new limitation? This limitation doesn't appear in 2.0.1...

ericch1 commented 8 years ago

Is it the verification of the path length that is new? I am wondering if it can be related to the bug we are searching here...???

ericch1 commented 8 years ago

Just tested with a shorter path: now start ok, but the bug reported here still occurs... :/

ericch1 commented 8 years ago

Hi,

I just reformated the output from "ompitrace" lib into files, maybe we can discover something there.

Here are the traces from the 2 processes for working (ompi 1.10.2): http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/ompitrace_v182.txt.formated.0 http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/ompitrace_v182.txt.formated.1

Here are the traces from the 2 processes for ompi 2.0.x (actually ompi/master): http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/ompitrace_v2x.txt.formated.0 http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/ompitrace_v2x.txt.formated.1

The major differences are at the very end...

ericch1 commented 8 years ago

About ompitrace: I was looking at ompitrace output to find traces of MPI_*_init calls and nothing appears. May I wrongly linked with ompitrace or not all calls are logged?

@knepley , also found a:

PETSC_HAVE_BROKEN_REQUEST_FREE

that maybe would help until all this is fixed?

knepley commented 8 years ago

On Wed, Aug 24, 2016 at 7:24 AM, Eric Chamberland notifications@github.com wrote:

About ompitrace: I was looking at ompitrace output to find traces of MPI_*_init calls and nothing appears. May I wrongly linked with ompitrace or not all calls are logged?

@knepley https://github.com/knepley , also found a:

PETSC_HAVE_BROKEN_REQUEST_FREE

that maybe would help until all this is fixed?

Yep, that is an option. You could run out of resources in a long run, but its better than crashing up front.

Matt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-242043880, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjoiUTz1ee0c33DWRiP6nNV9w8KBIfWks5qjDf5gaJpZM4JMi_z .

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

ericch1 commented 8 years ago

@ggouaillardet Ok, printed what you asked right into PETSc code. Until the last MPI_Request_free call, the count is still 1...

#if !defined(PETSC_HAVE_BROKEN_REQUEST_FREE)
  if (!to->use_alltoallv && !to->use_window) {   /* currently the to->requests etc are ALWAYS allocated even if not used */
    if (to->requests) {
      for (i=0; in; i++) {
  printf("ERIC DBG count = %d\n", ((struct dummy_object_t*)MPI_DOUBLE)->count);
  fflush(stdout);
        ierr = MPI_Request_free(to->requests + i);CHKERRQ(ierr);
        printf("ERIC DBG2 count = %d\n", ((struct dummy_object_t*)MPI_DOUBLE)->count);
  fflush(stdout);

so the last thing printed is: ERIC DBG count = 1 and "ERIC DBG2" never shows up...

ggouaillardet commented 8 years ago

@ericch1 are you saying the count decreases from n to 1 and the app crashes before it prints ERIC DBG2 count = 0 ?

ericch1 commented 8 years ago

Precisely!

I will trace more info with the debugger...

Eric

Le 2016-08-30 à 07:12, Gilles Gouaillardet a écrit :

@ericch1 https://github.com/ericch1 are you saying the count decreases from n to 1 and the app crashes before it prints |ERIC DBG2 count = 0| ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-243407229, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X5RfP1UbjctuGkI00cyL7VOZHQ2aks5qlBAlgaJpZM4JMi_z.

hjelmn commented 8 years ago

ok. that is odd. count should be at least 2. 1 for the original OBJ_CONSTRUCT and 1 for each OBJ_RETAIN from each request.

ggouaillardet commented 8 years ago

Yep, and that suggests persistent requests are unlikely the root cause, but more likely the symptom.

ericch1 commented 8 years ago

Here is the backtrace from ddd:

#0  0x00007fffde1eb0c7 in raise () from /lib64/libc.so.6
#1  0x00007fffde1ec478 in abort () from /lib64/libc.so.6
#2  0x00007fffde228784 in __libc_message () from /lib64/libc.so.6
#3  0x00007fffde22e026 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fffde22ed53 in _int_free () from /lib64/libc.so.6
#5  0x00007fffdda0d784 in opal_free (addr=0x7fffdef0aba0 , file=0x7fffd1830088 "pml_ob1_recvreq.h", line=133) at malloc.c:184
#6  0x00007fffd1823833 in mca_pml_ob1_recv_request_fini (recvreq=0xff8480) at pml_ob1_recvreq.h:133
#7  0x00007fffd1823e5b in mca_pml_ob1_recv_request_free (request=0x1686b40) at pml_ob1_recvreq.c:90
#8  0x00007fffdec4e5ec in ompi_request_free (request=0x1686b40) at ../../../../ompi/request/request.h:362
#9  0x00007fffdec4e6e5 in PMPI_Request_free (request=0x1686b40) at prequest_free.c:59
#10 0x00007fffe64f0f75 in VecScatterDestroy_PtoP (ctx=0x1681950) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:237
#11 0x00007fffe64ebe2a in VecScatterDestroy (ctx=0x1673e48) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vscat.c:1870
#12 0x00007fffe649a67e in VecDestroy_MPI (v=0x1670e20) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/impls/mpi/pdvec.c:25
#13 0x00007fffe6440981 in VecDestroy (v=0x165c2c0) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/interface/vector.c:432
ggouaillardet commented 8 years ago

@ericch1 is this the full backtrace? If yes, then this is not thread 0 (e.g. main)

Open MPI does not check only one MPI subroutine is invoked at a time unless MPI_THREAD_MULTIPLE, could the root cause be an invalid usage of MPI, and hence a race condition and hence a crash ? Generally speaking, should we implement such check ?

ericch1 commented 8 years ago

Here is the full backtrace:

(gdb) bt
#0  0x00007fffde1eb0c7 in raise () from /lib64/libc.so.6
#1  0x00007fffde1ec478 in abort () from /lib64/libc.so.6
#2  0x00007fffde228784 in __libc_message () from /lib64/libc.so.6
#3  0x00007fffde22e026 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fffde22ed53 in _int_free () from /lib64/libc.so.6
#5  0x00007fffdda0d784 in opal_free (addr=0x7fffdef0aba0 , file=0x7fffd1830088 "pml_ob1_recvreq.h", line=133) at malloc.c:184
#6  0x00007fffd1823833 in mca_pml_ob1_recv_request_fini (recvreq=0xff8480) at pml_ob1_recvreq.h:133
#7  0x00007fffd1823e5b in mca_pml_ob1_recv_request_free (request=0x1686b40) at pml_ob1_recvreq.c:90
#8  0x00007fffdec4e5ec in ompi_request_free (request=0x1686b40) at ../../../../ompi/request/request.h:362
#9  0x00007fffdec4e6e5 in PMPI_Request_free (request=0x1686b40) at prequest_free.c:59
#10 0x00007fffe64f0f75 in VecScatterDestroy_PtoP (ctx=0x1681950) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:237
#11 0x00007fffe64ebe2a in VecScatterDestroy (ctx=0x1673e48) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vscat.c:1870
#12 0x00007fffe649a67e in VecDestroy_MPI (v=0x1670e20) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/impls/mpi/pdvec.c:25
#13 0x00007fffe6440981 in VecDestroy (v=0x165c2c0) at /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/interface/vector.c:432
#14 0x00007fffebf798bc in GIREFVecDestroy (pVec=@0x165c2c0: 0x1670e20) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Petsc/girefConfigurationPETSc.h:115
#15 0x00007fffec0f9a25 in VecteurPETSc::detruitObjetPETSc (this=0x165c250) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Petsc/VecteurPETSc.cc:2293
#16 0x00007fffec0f981e in VecteurPETSc::~VecteurPETSc (this=0x165c250) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Petsc/VecteurPETSc.cc:288
#17 0x00007fffec0f9a59 in VecteurPETSc::~VecteurPETSc (this=0x165c250) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Petsc/VecteurPETSc.cc:282
#18 0x00007fffeb7d88ec in PPReactionsAppuiEL3D::~PPReactionsAppuiEL3D (this=0x7ffffffeb018) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/PPReactionsAppuiEL3D.cc:216
#19 0x00007fffefca7c6b in ProblemeGD::~ProblemeGD (this=0x7ffffffdde50) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Formulation/ProblemeGD.cc:573
#20 0x00000000004333ce in main (pArgc=2, pArgv=0x7fffffffa338) at /home/mefpp_ericc/depots_prepush/GIREF/app/src/Test.Formulation/Test.ProblemeGD.icc:205
(gdb) 
knepley commented 8 years ago

On Tue, Aug 30, 2016 at 8:56 AM, Gilles Gouaillardet < notifications@github.com> wrote:

@ericch1 https://github.com/ericch1 is this the full backtrace? If yes, then this is not thread 0 (e.g. main)

Open MPI does not check only one MPI subroutine is invoked at a time unless MPI_THREAD_MULTIPLE, could the root cause be an invalid usage of MPI, and hence a race condition and hence a crash ? Generally speaking, should we implement such check ?

We definitely do not use threads in PETSc. I don't think Eric is using threads.

Matt

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-243446820, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjoiTBMYmjRYialI2josmMLcC6FSFaWks5qlDaqgaJpZM4JMi_z .

What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

bosilca commented 8 years ago

@ericch1 what PETSc test should I run to reproduce this issue ?

ericch1 commented 8 years ago

It's not a simple PETSc test, it is a test from our non-regression test database that is running nightly on each architecture we support (we have 22 right now).... and it is bugging only with OpenMPI 2.0.x

This is why I tried to give the log from ompitrace (see above) and now the backtrace...

bosilca commented 8 years ago

The trace is too sparse to replicate (missing pointers to the request, the calls to test or wait and so on). I created all the possible combinations with persistent requests I could think of, and I can't reproduce this issue.

jsquyres commented 8 years ago

Would it be helpful for me to setup a webex for you guys to be able to screen share in an interactive debugging session?

ericch1 commented 8 years ago

Why not! My 2 debuggers are already opened :)

We could just try to see if we can do something this way!

ericch1 commented 8 years ago

Just 2 snapshots from debugger, one with the last line called before the "free", and the structure req_base partly exposed, just tell me what can be insightful... ddd1 ddd2

jsquyres commented 8 years ago

I have no idea what George's availability is like -- here's a Doodle: http://doodle.com/poll/sh4dvn3wzdsk883t

If you guys can find a common time that you're both available, I'll setup the webex.

hjelmn commented 8 years ago

Might be worth putting a watchpoint on ((struct dummy_object_t*)MPI_DOUBLE)->count and see where it changes. Should start as uninitialized, go to 1 on MPI_Init, then go up by one on each MPI_Recv/Send_init, and go down by one on each MPI_Request_free. If there are any other OBJ_RELEASE calls they are worth looking at.

ericch1 commented 8 years ago

ok, just added some fprintf into PETSc... before each MPI_init I saw in vpscat.c:

      } else {
        ierr = MPI_Send_init(Ssvalues+bs*sstarts[i],bs*sstarts[i+1]-bs*sstarts[i],MPIU_SCALAR,sprocs[i],tag,comm,swaits+i);CHKERRQ(ierr);
        printf("ERIC DBG MPI_Send_init to->requests(swaits) %p vpscat.c:%d\n", (void*)(rev_swaits+i), __LINE__);fflush(stdout);
      }
...

now just grep the output: grep ERIC ...:

ERIC DBG MPI_Send_init to->requests(swaits) (nil) vpscat.c:2731
ERIC DBG MPI_Recv_init to->rev_rwaits (nil) vpscat.c:2741
ERIC DBG MPI_Send_init from->rev_requests 0x3341740 vpscat.c:2718
ERIC DBG MPI_Send_init to->requests(swaits) 0x3341740 vpscat.c:2731
ERIC DBG MPI_Recv_init from->requests (rwaits) 0x3341740 vpscat.c:2737
ERIC DBG MPI_Recv_init to->rev_rwaits 0x3341740 vpscat.c:2741
ERIC DBG MPI_Send_init to->requests(swaits) (nil) vpscat.c:2731
ERIC DBG MPI_Recv_init to->rev_rwaits (nil) vpscat.c:2741
ERIC DBG MPI_Send_init to->requests(swaits) (nil) vpscat.c:2731
ERIC DBG MPI_Recv_init to->rev_rwaits (nil) vpscat.c:2741
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 5
ERIC DBG0 count = 1
ERIC DBG01 count = 1
ERIC DBG0 count = 1
ERIC DBG count = 1
ERIC DBG MPI_Request_free to->requests 0x32bef20 vpscat.c:211

and found that the MPI_Requestfree is called upon a pointer that seems never allocated by any MPI*init...

Here is the patch applied to vpscat.c.

patched_vpscat.c.txt

@knepley Did I missed a request creation?

ericch1 commented 8 years ago

Doh! There are errors in my printfs.. ignore the preceding output... correcting...

ericch1 commented 8 years ago

pfew... better, now the pointer of the allocated request is there, see the corrected output:

ERIC DBG MPI_Send_init to->requests+0(swaits) (0x307cf90) 0x307c860 vpscat.c:2732
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x307c860 vpscat.c:2742
ERIC DBG MPI_Send_init from->rev_requests+0 (0x30ff1a0) 0x30ff860 vpscat.c:2719
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x3103970) 0x3103240 vpscat.c:2732
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x30fff90) 0x30ff860 vpscat.c:2738
ERIC DBG MPI_Recv_init to->rev_rwaits+0 (0x30ff1a0) 0x3103240 vpscat.c:2742
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x311ab70) 0x311a440 vpscat.c:2732
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x311a440 vpscat.c:2742
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x31275d0) 0x3126ea0 vpscat.c:2732
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x3126ea0 vpscat.c:2742
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 11
ERIC DBG0 count = 5
ERIC DBG0 count = 1
ERIC DBG01 count = 1
ERIC DBG0 count = 1
ERIC DBG count = 1
ERIC DBG MPI_Request_free to->requests+0(0x307cf90) to(0x307c860) vpscat.c:211

and the corrected patch attached. patched_vpscat.c.txt

Sorry @knepley ...

hjelmn commented 8 years ago

Can you add -mca orte_tag_output 1 to the mpirun line? Will tag each line with the rank.

ericch1 commented 8 years ago

but this is only rank0 output... we separate output from each processes...

I am coming with more (interesting?) output...

Eric

On 30/08/16 04:35 PM, Nathan Hjelm wrote:

Can you add -mca orte_tag_output 1 to the mpirun line? Will tag each line with the rank.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-243572070, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X1bCZwh3ImvutB8NdfINcz3haMhDks5qlJQXgaJpZM4JMi_z.

ericch1 commented 8 years ago

Ok, nothing very new with additionals printf to catch other mpi calls into VecScatterDestroy...

Here are the output for both process #0 and #1:

[eric@lorien] Test.ProblemeGD (master $ u+1)> grep ERIC Resultats.hooke_3d_pen_8Hexa8_parallele/Cout.hooke_3d_pen_8Hexa8_parallele.n00.txt 
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x202ff50) 0x202f820 vpscat.c:2738
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x202f820 vpscat.c:2748
ERIC DBG MPI_Send_init from->rev_requests+0 (0x20b2750) 0x20b2e10 vpscat.c:2725
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x20b6f20) 0x20b67f0 vpscat.c:2738
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x20b3540) 0x20b2e10 vpscat.c:2744
ERIC DBG MPI_Recv_init to->rev_rwaits+0 (0x20b2750) 0x20b67f0 vpscat.c:2748
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x20ce120) 0x20cd9f0 vpscat.c:2738
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x20cd9f0 vpscat.c:2748
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x20dab80) 0x20da450 vpscat.c:2738
ERIC DBG MPI_Recv_init to->rev_rwaits+0 ((nil)) 0x20da450 vpscat.c:2748
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 5 vscat.c:1857
ERIC DBG0 count = 1 vscat.c:1857
ERIC DBG01 count = 1  vscat.c:1866
ERIC DBG0 count = 1 vpscat.c:154
ERIC DBG count = 1 vpscat.c:216
ERIC DBG MPI_Request_free to->requests+0(0x202ff50) to(0x202f820) vpscat.c:217
[eric@lorien] Test.ProblemeGD (master $ u+1)> grep ERIC Resultats.hooke_3d_pen_8Hexa8_parallele/Cout.hooke_3d_pen_8Hexa8_parallele.n01.txt 
ERIC DBG MPI_Send_init from->rev_requests+0 (0x34166d0) 0x341a1e0 vpscat.c:2725
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x341a910) 0x341a1e0 vpscat.c:2744
ERIC DBG MPI_Send_init from->rev_requests+0 (0x3497c00) 0x34982c0 vpscat.c:2725
ERIC DBG MPI_Send_init to->requests+0(swaits) (0x349c3d0) 0x349bca0 vpscat.c:2738
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x34989f0) 0x34982c0 vpscat.c:2744
ERIC DBG MPI_Recv_init to->rev_rwaits+0 (0x3497c00) 0x349bca0 vpscat.c:2748
ERIC DBG MPI_Send_init from->rev_requests+0 (0x34b08d0) 0x34b43e0 vpscat.c:2725
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x34b4b10) 0x34b43e0 vpscat.c:2744
ERIC DBG MPI_Send_init from->rev_requests+0 (0x34bcc00) 0x34c0710 vpscat.c:2725
ERIC DBG MPI_Recv_init from->requests+0 (rwaits) (0x34c0e40) 0x34c0710 vpscat.c:2744
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 5 vscat.c:1857
ERIC DBG0 count = 1 vscat.c:1857
ERIC DBG01 count = 1  vscat.c:1866
ERIC DBG0 count = 1 vpscat.c:154
ERIC DBG5 count = 1 vpscat.c:239
ERIC DBG MPI_Request_free from->requests+0(0x341a910) from(0x341a1e0) vpscat.c:240
hjelmn commented 8 years ago

The 11's look right to me. 1 + 10 requests. The 5 and 1's seem wrong.

hjelmn commented 8 years ago

Can you do the watchpoint on the MPI_DOUBLE count? Backtrace anytime the count is decremented would be helpful.

ericch1 commented 8 years ago

ok, I will try tomorrow if the old ddd wants to do this... :)

hjelmn commented 8 years ago

alright. let me know.

bosilca commented 8 years ago

If we assume that the issue is not with our requests, then the only other place where we manipulate the refcount of the datatypes are in the part of the code that handle the get_content. @ericch1 are you using any non-predefined datatype (one that is manually created via one of the MPIType* function) ?

ericch1 commented 8 years ago

Yes we do. At multiple places!!!

For many MPI_ISend/IRecv calls and also into MPI I/O.

Le 16-08-30 à 18:19, bosilca a écrit :

If we assume that the issue is not with our requests, then the only other place where we manipulate the refcount of the datatypes are in the part of the code that handle the get_content. @ericch1 https://github.com/ericch1 are you using any non-predefined datatype (one that is manually created via one of the MPIType* function) ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-243599779, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X0PB9YTZC-3eVen8_SQKj7wbcl0Eks5qlKxggaJpZM4JMi_z.

ggouaillardet commented 8 years ago

@ericch1 can you please apply my patch available at https://gist.github.com/ggouaillardet/2bd0a27227d067e65e2fc1a342f6acc9 on top of your latest patch ?

it fixes the typo at line 2748, and print some more stuff

ericch1 commented 8 years ago

Ok I applied the patch, here is the output from rank 0 now:

ERIC DBGX count = 1 on vectors 0x27d4b70 0x27d7820 0x27d3c20
ERIC DBG MPI_Send_init to->requests+0/1(swaits) (0x27d7f50) 0x27d7820 1 -> 2 vpscat.c:2748
ERIC DBG MPI_Recv_init to->rev_rwaits+0/1 (0x27d4510) 0x27d7820 2 -> 3 vpscat.c:2760
ERIC DBGX end count = 3 on vectors 0x27d4b70 0x27d7820 0x27d3c20 /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:2839
ERIC DBGX count = 3 on vectors 0x285ae10 0x285e7f0 0x2859140
ERIC DBG MPI_Send_init from->rev_requests+0/1 (0x285a750) 0x285ae10 3 -> 4 vpscat.c:2734
ERIC DBG MPI_Send_init to->requests+0/1(swaits) (0x285ef20) 0x285e7f0 4 -> 5 vpscat.c:2748
ERIC DBG MPI_Recv_init from->requests+0/1 (rwaits) (0x285b540) 0x285ae10 5 -> 6 vpscat.c:2755
ERIC DBG MPI_Recv_init to->rev_rwaits+0/1 (0x285a0f0) 0x285e7f0 6 -> 7 vpscat.c:2760
ERIC DBGX end count = 7 on vectors 0x285ae10 0x285e7f0 0x2859140 /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:2839
ERIC DBGX count = 7 on vectors 0x2872d40 0x28759f0 0x2871df0
ERIC DBG MPI_Send_init to->requests+0/1(swaits) (0x2876120) 0x28759f0 7 -> 8 vpscat.c:2748
ERIC DBG MPI_Recv_init to->rev_rwaits+0/1 (0x28726e0) 0x28759f0 8 -> 9 vpscat.c:2760
ERIC DBGX end count = 9 on vectors 0x2872d40 0x28759f0 0x2871df0 /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:2839
ERIC DBGX count = 9 on vectors 0x287f7a0 0x2882450 0x287e850
ERIC DBG MPI_Send_init to->requests+0/1(swaits) (0x2882b80) 0x2882450 9 -> 10 vpscat.c:2748
ERIC DBG MPI_Recv_init to->rev_rwaits+0/1 (0x287f140) 0x2882450 10 -> 11 vpscat.c:2760
ERIC DBGX end count = 11 on vectors 0x287f7a0 0x2882450 0x287e850 /home/mefpp_ericc/depots_prepush/ompi2_pr_1875/open-mpi-v2.x-debug/petsc-3.7.3-debug/src/vec/vec/utils/vpscat.c:2839
ERIC DBGX OK le compte est 11
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 11 vscat.c:1857
ERIC DBG0 count = 5 vscat.c:1857
ERIC DBG0 5 == count! watch from here
ERIC DBG0 count = 1 vscat.c:1857
ERIC DBG01 count = 1  vscat.c:1868
ERIC DBG0 count = 1 on vector 0x27d3c20 vpscat.c:154
ERIC DBG count = 1 vpscat.c:222
ERIC DBG MPI_Request_free to->requests+0(0x27d7f50) to(0x27d7820) vpscat.c:223

but I did add a watch on the ref count after "ERIC DBG0 5 == count" is reached... and after many "cont" for gdb that raised the count to 6 then decreased it to 5, it then reached a moment where it came down to "4"... from this line precise line:

io_ompio_file_set_view.c:81: ompi_datatype_destroy (&fh->f_etype);

which is called at the end because we are into the process of writing the results!

Here is a more complete stack:

(gdb) bt
#0  0x00007fffdec14619 in opal_thread_add_32 (addr=0x7fffdef09bb0 , delta=-1) at ../../opal/threads/thread_usage.h:143
#1  0x00007fffdec14bec in opal_obj_update (inc=-1, object=0x7fffdef09ba0 ) at ../../opal/class/opal_object.h:511
#2  ompi_datatype_destroy (type=0x17effa0) at ompi_datatype_create.c:84
#3  0x00007fffc3df3ac1 in mca_io_ompio_set_view_internal (fh=0x17efeb0, disp=0, etype=0x7fffdef087a0 , filetype=0x7fffdef087a0 , datarep=0x7ffff148af7b "native", info=0x7fffdef1f360 ) at io_ompio_file_set_view.c:81
#4  0x00007fffc3df4164 in mca_io_ompio_file_set_view (fp=0x18ede80, disp=0, etype=0x7fffdef087a0 , filetype=0x7fffdef087a0 , datarep=0x7ffff148af7b "native", info=0x7fffdef1f360 ) at io_ompio_file_set_view.c:234
#5  0x00007fffdec6f3c2 in PMPI_File_set_view (fh=0x18ede80, disp=0, etype=0x7fffdef087a0 , filetype=0x7fffdef087a0 , datarep=0x7ffff148af7b "native", info=0x7fffdef1f360 ) at pfile_set_view.c:73
Python Exception  There is no member named _M_dataplus.: 
Python Exception  There is no member named _M_dataplus.: 
#6  0x00007ffff1355f0d in GISLectureEcriture::ecritGISMPI (this=0x7ffffffda518, pNomFichier=, pGIS=..., pInfoSupp=) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Maillage/GISLectureEcriture.icc:414
Python Exception  There is no member named _M_dataplus.: 
#7  0x00007ffff1343f7d in Champ::exporteParallele (this=0x1613720, pNomFichier=) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Champs/Champ.cc:1500
Python Exception  There is no member named _M_dataplus.: 
#8  0x00007fffeea5207b in PiloteExportationGIREF::exporteInfo (this=0x7fffffff8078, pExportation=..., pPrefixeFichiers=) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Interface/PiloteExportationGIREF.cc:215
Python Exception  There is no member named _M_dataplus.: 
#9  0x00007fffee673790 in Exportation::exporteInfo (this=0x7ffffffefb60, pPrefixeFichier=) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Interface/Exportation.cc:2796
#10 0x00007fffeb7238cb in PPExportation::calcul (this=0x7fffffff8490) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/PPExportation.cc:367
#11 0x00007fffeb5bbe04 in PPAppelFonctionGen::preCalcul (this=0x7fffffff8490) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/PPAppelFonctionGen.cc:170
#12 0x00007fffeb83086a in PPTraitementGen::effectueCalcul (this=0x7fffffff8490) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/PPTraitementGen.cc:1105
#13 0x00007fffeb5a74ee in GestionPrePostTraitement::executeTraitement (this=0x7ffffffe6740, pTraitement=..., pNoPasDeTemps=2, pNoSousPas=-2) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/GestionPrePostTraitement.cc:917
#14 0x00007fffeb5a7ed2 in GestionPrePostTraitement::executePrePostTraitement (this=0x7ffffffe6740, pNomDeLaListePP=, pNoPasDeTemps=2, pNoSousPas=-2) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/PrePostTraitement/GestionPrePostTraitement.cc:622
#15 0x00007fffefc5d71b in ProblemeEF::executePrePostTraitement (this=0x7ffffffe6738, pPositionDeLAppel=TraitsProblemeEFPrePost::eFin, pNoPasDeTemps=2, pNoSousPas=-2) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Formulation/ProblemeEF.cc:1669
#16 0x00007fffebe06458 in SolveurStatNlinPETSc::resoudre (this=0x13c67f0, pProbleme=...) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Petsc/SolveurStatNlinPETSc.cc:543
#17 0x00007fffefd0c6d5 in ProblemeGD::resoudrePas (this=0x7ffffffdde50) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Formulation/ProblemeGD.cc:10551
#18 0x00007fffefd0ba3d in ProblemeGD::resoudre (this=0x7ffffffdde50) at /home/mefpp_ericc/depots_prepush/GIREF/src/commun/Formulation/ProblemeGD.cc:10442
#19 0x00000000004330df in main (pArgc=2, pArgv=0x7fffffffa338) at /home/mefpp_ericc/depots_prepush/GIREF/app/src/Test.Formulation/Test.ProblemeGD.icc:189
(gdb) 

So maybe the bug is into MPI I/O?

edgargabriel commented 8 years ago

Argh, yes I see the problem. It was introduced when we fixed the get_fileview problem reported earlier this summer. The etype is not duplicated anymore if it is a basic datatype. I will provide a patch in a couple of minutes.

That being said, I did originally check the tracefiles that you posted , and did not see any calls to MPI I/O routines.

rhc54 commented 8 years ago

I very much doubt that ompitrace covered the MPI I/O calls - might be worth extending it. Probably needs to cover other functions added since I last touched it 5-8 years ago.

ericch1 commented 8 years ago

maybe I misused ompitrace lib:

I did just added it to the link...... What is the good usage of this?

ggouaillardet commented 8 years ago

@ericch1 can you please give it an other try with romio instead of ompio ? Just to make sure there is not an other issue

mpirun --mca io romio314 ...

ericch1 commented 8 years ago

Success!!! :)

On 31/08/16 11:06 AM, Gilles Gouaillardet wrote:

@ericch1 https://github.com/ericch1 can you please give it an other try with romio instead of ompio ? Just to make sure there is not an other issue

|mpirun --mca io romio314 ...|

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-243794296, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X4PU2sXs5hXpsRiaT4NZaXpq6iItks5qlZhZgaJpZM4JMi_z.

rhc54 commented 8 years ago

ompitrace is intended to provide you with exactly what it did - a list of MPI calls being made by your program. Sadly, it hasn't been maintained over the years as the standard has added new functions, nor was it fully comprehensive at the start. I covered the functions of interest at the time.

hppritcha commented 8 years ago

@edgargabriel if you could open a PR on master to fix this @jsquyres and I can better decide whether it should go in to 2.0.1 or wait till 2.0.2

edgargabriel commented 8 years ago

ok, but it will have to be a custom patch for v2.x, since this part of the code is now in a different location.

The bigger issue that I have is that I think I understand what the problem is, but no matter what I try I can not get the reference counter for the datatype get less than one.

edgargabriel commented 8 years ago

Not sure what is the easiest way to provide you a patch to try whether my solution fixes this problem. I have a branch in my private repo git@github.com:edgargabriel/ompi-release.git namely pr/datatype-refcount-fix

Alternatively, here is the patch (it is actually just a one-liner), it might be easier to apply it manually

diff --git a/ompi/mca/io/ompio/io_ompio_file_set_view.c b/ompi/mca/io/ompio/io_ompio_file_set_view.c
index f22e9c7..079283f 100644
--- a/ompi/mca/io/ompio/io_ompio_file_set_view.c
+++ b/ompi/mca/io/ompio/io_ompio_file_set_view.c
@@ -44,6 +44,7 @@ static int datatype_duplicate  (ompi_datatype_t *oldtype, ompi_datatype_t **newt
 {
     ompi_datatype_t *type;
     if( ompi_datatype_is_predefined(oldtype) ) {
+        OBJ_RETAIN(oldtype);
         *newtype = oldtype;
         return OMPI_SUCCESS;
     }

Could you give it a shot and let me know whether this fixes the issue?

EDIT: Used the Markdown "patch" markers