Closed jsquyres closed 8 years ago
@bosilca @ericch1 I'm guessing you guys don't need the webex I setup for today, then. I'll send a cancelation.
Great! It fixes the bug!!! Hourra! :)
great! will file a pr for both master and v2.x in a bit.
Interesting.
Excellent, we finally got an answer to this problem! Thanks @edgargabriel
@edgargabriel just curious; in which commit of OMPI I/O was this introduced?
it took me a while but I understand now why this never happened in my tests. ompi_datatype_destroy (what we are using in that function in ompio) explicitely protects against this scenario
if( ompi_datatype_is_predefined(pData) && (pData->super.super.obj_reference_count <= 1) )
return OMPI_ERROR;
that's why it never segfaulted in ompio. However, probably different parts of the code just do on OBJ_RELEASE and that does not have the protection that the counter can not become less than one
this is not in ROMIO, but in ompio. It was
io/ompio: file_getview and file_preallocate fixes #1646
And the issue on why we do not duplicate basic datatypes in this scenario was that if the user calls file_get_view on the current file view, the returned datatype has to be a predefined datatype (if that is what he used during file_set_view). If we would duplicate the predefined type, the combiner of that datatype would say 'duplicate' instead of 'predefined'.
@edgargabriel Any chance you can get a v2.x PR today / before the nightly snapshot?
jsut did that on 2.x, will do the master in a minute.
@edgargabriel, sorry, my bad 😊. I fixed the comment.
no problem :-)
Good news: this morning, all of 46 tests that were failing yesterday are now ok! So the forthcoming 2.0.1 will be compatible "out-of-the-box" with our software! Thanks to all! Our tests database (~2200 tests) will still be automatically launched against ompi-release/v2.x branch all nights to ensure fast feedback.
Sweet!
You might want to blacklist Open MPI v2.0.0 in your configury, just to avoid unexpected user problems. There should be fairly robust ways to determine Open MPI's version (via ompi_info
or by checking OPEN_MPI
and OMPI_MAJOR_VERSION
, OMPI_MINOR_VERSION
, and OMPI_RELEASE_VERSION
in mpi.h
).
Hehe...
just for "fun" here is what I did:
First verify at compile time the compatible MPIs version with OMPI_*_VERSION #defines..
Second, because ABI of 2.0.1 is compatible with 2.0.0 and so the user could change the dynamic libs after compilation, at runtime I verify that the compiled version equals the beginning of the constructed string via MPI_Get_library_version function which is into the ".so"...
So, in a way, I voluntarily tie hard the version of the lib used at compile time to be the one used at runtime... So no bad surprises... :)
Thanks for the good suggestion!
Eric
On 01/09/16 10:11 AM, Jeff Squyres wrote:
Sweet!
You might want to blacklist Open MPI v2.0.0 in your configury, just to avoid unexpected user problems. There should be fairly robust ways to determine Open MPI's version (via |ompi_info| or by checking |OPEN_MPI| and |OMPI_MAJOR_VERSION|, |OMPI_MINOR_VERSION|, and |OMPI_RELEASE_VERSION| in |mpi.h|).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/1875#issuecomment-244091025, or mute the thread https://github.com/notifications/unsubscribe-auth/AEH-X5uOPQUR4wQVheaXd5AAKv8BWfN2ks5qlt0hgaJpZM4JMi_z.
Or simply blacklist ompio if Ooen MPI 2.0..0 is detected
mpirun --mca io ^ompio
or
export OMPI_MCA_io=^ompio
Btw, if you are running on lustre, romio is used instead of ompio.
I am not saying a workaround is to run on a lustre filesystem, i am just pointing the bug might appear to have been automagically fixed when it is just hidden
This was fixed in v2.0.1.
According to Eric Chamberland in http://www.open-mpi.org/community/lists/devel/2016/07/19210.php, he's getting a failure in a petsc test. Here's the backtrace:
@hjelmn @bosilca Could you have a look?