open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

Petsc test failing: possible MPI_REQUEST_FREE issue #1875

Closed jsquyres closed 8 years ago

jsquyres commented 8 years ago

According to Eric Chamberland in http://www.open-mpi.org/community/lists/devel/2016/07/19210.php, he's getting a failure in a petsc test. Here's the backtrace:

*** Error in `/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt': free(): invalid pointer: 0x00007f9ab09c6020 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7277f)[0x7f9ab019b77f]
/lib64/libc.so.6(+0x78026)[0x7f9ab01a1026]
/lib64/libc.so.6(+0x78d53)[0x7f9ab01a1d53]
/opt/openmpi-2.x_opt/lib/openmpi/mca_pml_ob1.so(+0x172a1)[0x7f9aa3df32a1]
/opt/openmpi-2.x_opt/lib/libmpi.so.0(MPI_Request_free+0x4c)[0x7f9ab0761dac]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4adaf9)[0x7f9ab7fa2af9]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecScatterDestroy+0x68d)[0x7f9ab7f9dc35]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(+0x4574e7)[0x7f9ab7f4c4e7]
/opt/petsc-3.7.2_debug_openmpi_2.x/lib/libpetsc.so.3.7(VecDestroy+0x648)[0x7f9ab7ef28ca]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_Z15GIREFVecDestroyRP6_p_Vec+0xe)[0x7f9abc9746de]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN12VecteurPETScD1Ev+0x31)[0x7f9abca8bfa1]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD2Ev+0x20c)[0x7f9abc9a013c]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Petsc.so(_ZN10SolveurGCPD0Ev+0x9)[0x7f9abc9a01f9]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/lib/libgiref_opt_Formulation.so(_ZN10ProblemeGDD2Ev+0x42)[0x7f9abeeb94e2]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4159b9]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9ab014ab25]
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.ProblemeGD.opt[0x4084dc]

@hjelmn @bosilca Could you have a look?

jsquyres commented 8 years ago

@bosilca @ericch1 I'm guessing you guys don't need the webex I setup for today, then. I'll send a cancelation.

ericch1 commented 8 years ago

Great! It fixes the bug!!! Hourra! :)

edgargabriel commented 8 years ago

great! will file a pr for both master and v2.x in a bit.

jladd-mlnx commented 8 years ago

Interesting.

bosilca commented 8 years ago

Excellent, we finally got an answer to this problem! Thanks @edgargabriel

jladd-mlnx commented 8 years ago

@edgargabriel just curious; in which commit of OMPI I/O was this introduced?

edgargabriel commented 8 years ago

it took me a while but I understand now why this never happened in my tests. ompi_datatype_destroy (what we are using in that function in ompio) explicitely protects against this scenario

   if( ompi_datatype_is_predefined(pData) && (pData->super.super.obj_reference_count <= 1) )
        return OMPI_ERROR;

that's why it never segfaulted in ompio. However, probably different parts of the code just do on OBJ_RELEASE and that does not have the protection that the counter can not become less than one

edgargabriel commented 8 years ago

this is not in ROMIO, but in ompio. It was

io/ompio: file_getview and file_preallocate fixes #1646

And the issue on why we do not duplicate basic datatypes in this scenario was that if the user calls file_get_view on the current file view, the returned datatype has to be a predefined datatype (if that is what he used during file_set_view). If we would duplicate the predefined type, the combiner of that datatype would say 'duplicate' instead of 'predefined'.

jsquyres commented 8 years ago

@edgargabriel Any chance you can get a v2.x PR today / before the nightly snapshot?

edgargabriel commented 8 years ago

jsut did that on 2.x, will do the master in a minute.

jladd-mlnx commented 8 years ago

@edgargabriel, sorry, my bad 😊. I fixed the comment.

edgargabriel commented 8 years ago

no problem :-)

ericch1 commented 8 years ago

Good news: this morning, all of 46 tests that were failing yesterday are now ok! So the forthcoming 2.0.1 will be compatible "out-of-the-box" with our software! Thanks to all! Our tests database (~2200 tests) will still be automatically launched against ompi-release/v2.x branch all nights to ensure fast feedback.

jsquyres commented 8 years ago

Sweet!

You might want to blacklist Open MPI v2.0.0 in your configury, just to avoid unexpected user problems. There should be fairly robust ways to determine Open MPI's version (via ompi_info or by checking OPEN_MPI and OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, and OMPI_RELEASE_VERSION in mpi.h).

ericch1 commented 8 years ago

Hehe...

just for "fun" here is what I did:

First verify at compile time the compatible MPIs version with OMPI_*_VERSION #defines..

Second, because ABI of 2.0.1 is compatible with 2.0.0 and so the user could change the dynamic libs after compilation, at runtime I verify that the compiled version equals the beginning of the constructed string via MPI_Get_library_version function which is into the ".so"...

So, in a way, I voluntarily tie hard the version of the lib used at compile time to be the one used at runtime... So no bad surprises... :)

Thanks for the good suggestion!

Eric

On 01/09/16 10:11 AM, Jeff Squyres wrote:

Sweet!

You might want to blacklist Open MPI v2.0.0 in your configury, just to avoid unexpected user problems. There should be fairly robust ways to determine Open MPI's version (via |ompi_info| or by checking |OPEN_MPI| and |OMPI_MAJOR_VERSION|, |OMPI_MINOR_VERSION|, and |OMPI_RELEASE_VERSION| in |mpi.h|).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-244091025, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X5uOPQUR4wQVheaXd5AAKv8BWfN2ks5qlt0hgaJpZM4JMi_z.

ggouaillardet commented 8 years ago

Or simply blacklist ompio if Ooen MPI 2.0..0 is detected mpirun --mca io ^ompio or export OMPI_MCA_io=^ompio

ggouaillardet commented 8 years ago

Btw, if you are running on lustre, romio is used instead of ompio.

I am not saying a workaround is to run on a lustre filesystem, i am just pointing the bug might appear to have been automagically fixed when it is just hidden

jsquyres commented 8 years ago

This was fixed in v2.0.1.