hpc / ompi

Open MPI main development repository
Other
1 stars 2 forks source link

sessions: attributes subsystem teardown needs refactoring #18

Closed hppritcha closed 5 years ago

hppritcha commented 5 years ago

The current scheme for tearing down the attributes subsystem using the new opal cleanup infrastructure doesn't work. There may be more independence between other subsystems, but this one at least has some dependent subsystems that end up segfaulting when their cleanup functions are invoked.

Test case

#include <stdio.h>
#include <unistd.h>
#include "mpi.h"

static int delete_fn_calls = 0;
static int delete_key1(MPI_Comm comm, int key, void *value, void *extra_state)
{
    fprintf(stderr, "inside delete_key1 %d\n", delete_fn_calls);
    delete_fn_calls++;

    return MPI_SUCCESS;
}

int main(int argc, char* argv[])
{
    int rank, size, len, key;
    int *val, backend=12345;

    char version[MPI_MAX_LIBRARY_VERSION_STRING];

    MPI_Init(&argc, &argv);

    fprintf(stderr, "delete_key1 is at %p\n", delete_key1);
    MPI_Comm_create_keyval(MPI_COMM_NULL_COPY_FN,delete_key1, &key,0);
    val = &backend;
    MPI_Comm_set_attr(MPI_COMM_WORLD, key, val);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Get_library_version(version, &len);
    fprintf(stderr, "Hello, world, I am %d of %d, (%s, %d)\n",
           rank, size, version, len);
    MPI_Comm_free_keyval(&key);
    fprintf(stderr, "freed key \n");
    sleep(5);
    MPI_Finalize();

    return 0;
}

Running, with a debug statement added to OMPI shows:

delete_key1 is at 0x55b0c0722b9a
Hello, world, I am 0 of 1, (Open MPI v4.1.0a1, package: Open MPI hpp@ubuntu Distribution, ident: 4.1.0a1, repo rev: v2.x-dev-6895-g4056e930, Unreleased developer copy, 139)
freed key 
cleaning up ompi_mpiext_fini
cleaning up ompi_mpi_instance_cleanup_pml
cleaning up ompi_attr_finalize
cleaning up ompi_win_finalize
cleaning up ompi_file_finalize
cleaning up ompi_comm_finalize
[ubuntu:82901] *** Process received signal ***
[ubuntu:82901] Signal: Segmentation fault (11)
[ubuntu:82901] Signal code: Address not mapped (1)
[ubuntu:82901] Failing at address: 0x30
[ubuntu:82901] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fbe6c270f20]
[ubuntu:82901] [ 1] /home/hpp/sessions_install/lib/libopen-pal.so.0(opal_hash_table_get_value_uint32+0x17)[0x7fbe6bbf665b]
[ubuntu:82901] [ 2] /home/hpp/sessions_install/lib/libmpi.so.0(+0x2f044)[0x7fbe6c652044]
[ubuntu:82901] [ 3] /home/hpp/sessions_install/lib/libmpi.so.0(ompi_attr_delete_all+0x1a0)[0x7fbe6c6529c1]
[ubuntu:82901] [ 4] /home/hpp/sessions_install/lib/libmpi.so.0(+0x32c25)[0x7fbe6c655c25]
[ubuntu:82901] [ 5] /home/hpp/sessions_install/lib/libopen-pal.so.0(opal_finalize_cleanup_domain+0x64)[0x7fbe6bc06132]
[ubuntu:82901] [ 6] /home/hpp/sessions_install/lib/libopen-pal.so.0(opal_finalize+0x56)[0x7fbe6bc0638c]
[ubuntu:82901] [ 7] /home/hpp/sessions_install/lib/libopen-rte.so.0(orte_finalize+0x1ba)[0x7fbe6bf343ba]
[ubuntu:82901] [ 8] /home/hpp/sessions_install/lib/libmpi.so.0(+0x75bbe)[0x7fbe6c698bbe]
[ubuntu:82901] [ 9] /home/hpp/sessions_install/lib/libmpi.so.0(ompi_mpi_instance_finalize+0x125)[0x7fbe6c698ed1]
[ubuntu:82901] [10] /home/hpp/sessions_install/lib/libmpi.so.0(ompi_mpi_finalize+0x460)[0x7fbe6c68b522]
[ubuntu:82901] [11] /home/hpp/sessions_install/lib/libmpi.so.0(PMPI_Finalize+0x61)[0x7fbe6c6d2288]
[ubuntu:82901] [12] ./attr(+0xd5f)[0x55b0c0722d5f]
[ubuntu:82901] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fbe6c253b97]
[ubuntu:82901] [14] ./attr(+0xaba)[0x55b0c0722aba]
[ubuntu:82901] *** End of error message ***

The comm cleanup code segfaults trying to clean up the attributes hash associated with the world communicator. The hash_keyval, for example is NULL by this point.

The attribute framework needs to be treated more as a dependency of the win, comm, and type subsystems.

hppritcha commented 5 years ago

Closes via #19