pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
564 stars 279 forks source link

MPI_Init_thread and MPI_Comm_create_errhandler causes error messages on exit #7187

Closed biddisco closed 1 month ago

biddisco commented 1 month ago

This bug appears to be related to #268 - in our applications we see lockups on exit when MPI_Init_thread is used, which do not appear when MPI_Init is used - a simple reproducer does not lockup/hang, but does cause error messages using init_thread that go away with init alone. The same test reproducer runs without error using openmpi

Tested with spack installed mpich@3.2.1

generated output showing error message from mpich

MPI_Init_thread()
set_error_handler()
test_exception()
exception_thrown 1
free_error_handler()
MPI_Finalize()
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:225
Assertion failed in file src/mpi/init/initthread.c at line 226: err == 0
internal ABORT - process 0
[cli_0]: write_line error; fd=6 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

test program mpich_error_reproducer.cpp

#include <exception>
#include <iostream>
#include <string>
//
#include "mpi.h"

// -------------------------------------------------------------
// mpi::exception boilerplate
namespace pika::mpi {
    namespace detail {
        std::string error_message(int code)
        {
            int N = 1023;
            const int len = 1024;
            char buff[len] = {0};
            MPI_Error_string(code, buff, &N);
            return std::string(buff);
        }
    }    // namespace detail

    struct exception : std::runtime_error
    {
        explicit exception(int err_code, const std::string& msg = "")
          : std::runtime_error(
                msg + std::string(" MPI returned with error: ") + detail::error_message(err_code))
          , err_code_(err_code)
        {
        }
        int get_mpi_errorcode() const noexcept { return err_code_; }

    protected:
        int err_code_;
    };

}    // namespace pika::mpi

// -------------------------------------------------------------
MPI_Errhandler pika_mpi_errhandler = 0;

// -------------------------------------------------------------
void pika_MPI_Handler(MPI_Comm*, int* errorcode, ...)
{
    throw pika::mpi::exception(*errorcode, "pika MPI error->exception handler");
}

// -------------------------------------------------------------
void set_error_handler()
{
    MPI_Comm_create_errhandler(pika_MPI_Handler, &pika_mpi_errhandler);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, pika_mpi_errhandler);
}

void free_error_handler()
{
    MPI_Errhandler_free(&pika_mpi_errhandler);
    pika_mpi_errhandler = 0;
}

// -------------------------------------------------------------
void test_exception()
{
    // Exception thrown due to null buffer
    int *data = nullptr, count = 0;
    bool exception_thrown = false;
    try
    {
        MPI_Bcast(data, count, MPI_DATATYPE_NULL, -1, MPI_COMM_WORLD);
    }
    catch (std::runtime_error const&)
    {
        exception_thrown = true;
    }
    std::cout << "exception_thrown " << exception_thrown << std::endl;
}

// -------------------------------------------------------------
int main(int argc, char* argv[])
{
    int provided, preferred = MPI_THREAD_MULTIPLE;

#if 1    
    // causes error
    std::cout << "MPI_Init_thread()" << std::endl;
    MPI_Init_thread(&argc, &argv, preferred, &provided);
#else
    // does not cause error
    std::cout << "MPI_Init()" << std::endl;
    MPI_Init(&argc, &argv);
#endif

    std::cout << "set_error_handler()" << std::endl;
    set_error_handler();

    std::cout << "test_exception()" << std::endl;
    test_exception();

    std::cout << "free_error_handler()" << std::endl;
    free_error_handler();

    std::cout << "MPI_Finalize()" << std::endl;
    MPI_Finalize();

    std::cout << "exit" << std::endl;
    return 0;
}

cmake for convenience since this is a cpp test

cmake_minimum_required(VERSION 3.12)
project (cpptest C CXX)

find_package(MPI REQUIRED)

include_directories(${MPI_INCLUDE_PATH})
message(STATUS "MPI found version ${MPI_CXX_VERSION}")

set(MPI_TEST_SRCS
  mpich_error_reproducer.cpp
)

add_executable(mpich_error_reproducer mpich_error_reproducer.cpp)
target_link_libraries(mpich_error_reproducer MPI::MPI_CXX)
hzhou commented 1 month ago

Could you check using the latest mpich releases?

biddisco commented 1 month ago

apologies for posting with an older mpich - with mpich@4.2.2 the error is

MPI_Init_thread()
set_error_handler()
test_exception()
exception_thrown 1
free_error_handler()
Assertion failed in file src/binding/c/c_binding.c at line 41884: 0
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(+0x323319) [0x784340923319]
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(+0x267138) [0x784340867138]
/home/biddisco/opt/spack.git/opt/spack/linux-pop22-skylake/gcc-12.3.0/mpich-4.2.2-v2mwvqzohg3rjhhtbdxaji7dehrlwvbu/lib/libmpi.so.12(PMPI_Errhandler_free+0x2e0) [0x7843406ec420]
./mpich_error_reproducer(+0x3929) [0x5efd3572b929]
./mpich_error_reproducer(+0x3ad8) [0x5efd3572bad8]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x78433fe29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x78433fe29e40]
./mpich_error_reproducer(+0x35c5) [0x5efd3572b5c5]
internal ABORT - process 0
[unset]: PMIU_write error; fd=-1 buf=:cmd=abort exitcode=1 message=internal ABORT - process 0
:
system msg for write_line failure : Bad file descriptor
hzhou commented 1 month ago

I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call.

I don't have a solution. Exceptions are currently not supported by MPICH.

biddisco commented 1 month ago

I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call.

I don't have a solution. Exceptions are currently not supported by MPICH.

ok thanks for the feedback. Can I manually unlock the mutex before throwing the exception? (is there either a public facing api call that would do that directly, or perhaps a call that would have a side effect of doing that?)

hzhou commented 1 month ago

I see. The exception jumps out of the middle of a critical section without unlock, this causes later locking failures upon another MPI call. I don't have a solution. Exceptions are currently not supported by MPICH.

ok thanks for the feedback. Can I manually unlock the mutex before throwing the exception? (is there either a public facing api call that would do that directly, or perhaps a call that would have a side effect of doing that?)

Not really. However, I don't see why we need to call the user handler inside a critical section. Potentially we can fix it.

Alternatively, you can create a wrapper for MPI calls and use MPI_ERRORS_RETURN, and throw the exception in the wrapper.

biddisco commented 1 month ago

ok, thanks again. It isn't a huge show stopper if we can't throw exceptions - I will close this issue and we'll fix our error handling to do something else when using mpich.