Closed biddisco closed 1 month ago
Could you check using the latest mpich releases?
apologies for posting with an older mpich - with mpich@4.2.2 the error is
MPI_Init_thread()
set_error_handler()
test_exception()
exception_thrown 1
free_error_handler()
Assertion failed in file src/binding/c/c_binding.c at line 41884: 0
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(+0x323319) [0x784340923319]
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(+0x267138) [0x784340867138]
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(PMPI_Errhandler_free+0x2e0) [0x7843406ec420]
./mpich_error_reproducer(+0x3929) [0x5efd3572b929]
./mpich_error_reproducer(+0x3ad8) [0x5efd3572bad8]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x78433fe29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x78433fe29e40]
./mpich_error_reproducer(+0x35c5) [0x5efd3572b5c5]
internal ABORT - process 0
[unset]: PMIU_write error; fd=-1 buf=:cmd=abort exitcode=1 message=internal ABORT - process 0
:
system msg for write_line failure : Bad file descriptor
I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call.
I don't have a solution. Exceptions are currently not supported by MPICH.
I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call.
I don't have a solution. Exceptions are currently not supported by MPICH.
ok thanks for the feedback. Can I manually unlock the mutex before throwing the exception? (is there either a public facing api call that would do that directly, or perhaps a call that would have a side effect of doing that?)
I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call. I don't have a solution. Exceptions are currently not supported by MPICH.
ok thanks for the feedback. Can I manually unlock the mutex before throwing the exception? (is there either a public facing api call that would do that directly, or perhaps a call that would have a side effect of doing that?)
Not really. However, I don't see why we need to call the user handler inside a critical section. Potentially we can fix it.
Alternatively, you can create a wrapper for MPI calls and use MPI_ERRORS_RETURN
, and throw the exception in the wrapper.
ok, thanks again. It isn't a huge show stopper if we can't throw exceptions - I will close this issue and we'll fix our error handling to do something else when using mpich.
This bug appears to be related to #268 - in our applications we see lockups on exit when
MPI_Init_thread
is used, which do not appear whenMPI_Init
is used - a simple reproducer does not lockup/hang, but does cause error messages using init_thread that go away with init alone. The same test reproducer runs without error using openmpiTested with spack installed
mpich@3.2.1
generated output showing error message from mpich
test program
mpich_error_reproducer.cpp
cmake for convenience since this is a cpp test