pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
560 stars 279 forks source link

"double free or corruption" observed running MPICH with Intel Compiler and IntelPython in PATH #6666

Open louspe-linaro opened 1 year ago

louspe-linaro commented 1 year ago

Loading the Intel environment before building MPICH with IntelPython activated results in a "double free or corruption" when running some examples - observed with both C and Fortran.

To set up the Intel 2023.2.0 environment:

source <path_to_intel_install>/compiler/2023.2.0/env/vars.sh
source <path_to_intel_install>/intelpython/python3.9/env/vars.sh

I specifically am not using <path_to_intel_install/setvars.sh because it would load Intel MPI.

The intelpython puts libfabric in your PATH. This is detected by MPICH when building. I noticed no change with --with-libfabric=embedded enabled or not.

In both cases, some applications will report a "double free or corruption" at exit.

Reproducer:

 $ cat simple.c 
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int my_rank;
int processes;

void func2()
{
    int rank = my_rank;
    int total = processes;
    char message[100];

    sprintf(message, "Greetings from process %d!", rank);
    memset(message, 0, 100);
}

void func1()
{
    MPI_Barrier(MPI_COMM_WORLD); // make staggered stop more likely
    sleep(my_rank);              // force staggered stop
    func2();
}

int main(int argc, char** argv)
{
    int numbers[1000];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &processes);

    if (my_rank == 0) 
    {
        fprintf(stdout, "Hello World!\n");
        fprintf(stderr, "Upps, something is wrong!\n");
        fprintf(stdout, "Number of arguments %d and first is %s\n", argc, argv[0]);
    }

    int i;
    for (i=0; i < 1000; i++)
    {
        numbers[i] = i;
    }

    sleep(5); 

    func1();

    MPI_Finalize();

    return 0;
}

MPICH Compilation:

 $ mpirun --version
HYDRA build details:
    Version:                                 4.0.3
    Release Date:                            Tue Nov  8 09:51:06 CST 2022
    CC:                              icc   -m64  -m64 
    Configure options:                       '--disable-option-checking' '--prefix=/home/louspe01/.conan/data/mpich/4.0.3/louise/test/package/8ad367ce318146e7032d51425103be5a0064d2ca' '--enable-debug' '--enable-debuginfo' '--enable-shared' 'F90=' '--bindir=${prefix}/bin' '--sbindir=${prefix}/bin' '--libexecdir=${prefix}/bin' '--libdir=${prefix}/lib' '--includedir=${prefix}/include' '--oldincludedir=${prefix}/include' '--datarootdir=${prefix}/share' 'CC=icc' 'CFLAGS=-m64 ' 'LDFLAGS=-m64' 'LIBS=' 'CPPFLAGS= ' 'CXX=icpc' 'CXXFLAGS=-m64 ' 'FC=ifort' 'F77=ifort' '--cache-file=/dev/null' '--srcdir=/home/louspe01/.conan/data/mpich/4.0.3/louise/test/source/mpich-4.0.3/src/pm/hydra'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select

Compile reproducer:

mpicc -g simplc.c -o simple

Running Reproducer:

 $ ./simple 
Hello World!
Upps, something is wrong!
Number of arguments 1 and first is ./simple
double free or corruption (!prev)
Aborted (core dumped)

Backtrace for the error:

$ gdb ./simple
...
(gdb) r
Starting program: /home/louspe01/repo/forge/test/ddtscripts/base/ddt/offline/simple 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff05ff640 (LWP 1662544)]
[New Thread 0x7fffefdfe640 (LWP 1662545)]
Hello World!
Upps, something is wrong!
Number of arguments 1 and first is /home/louspe01/repo/forge/test/ddtscripts/base/ddt/offline/simple
[Thread 0x7ffff05ff640 (LWP 1662544) exited]
[Thread 0x7fffefdfe640 (LWP 1662545) exited]
double free or corruption (!prev)

Thread 1 "simple" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352271680)
    at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352271680)
    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352271680) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352271680, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff3642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff36287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff36896f6 in __libc_message
    (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff37dbb8c "%s\n")
    at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff36a0d7c in malloc_printerr
    (str=str@entry=0x7ffff37de7d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7  0x00007ffff36a2efc in _int_free
    (av=0x7ffff3819c80 <main_arena>, p=0x47f850, have_lock=<optimized out>)
    at ./malloc/malloc.c:4591
#8  0x00007ffff36a54d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007ffff0e09571 in ofi_cleanup_prov ()
    at /home/louspe01/.conan/data/intel_installation/2023.2.0/louise/test/package/4f459a94dbd4c62b669f92843b0daa45ca1e3751/mpi/2021.10.0//libfabric/lib/libfabric.so.1
#10 0x00007ffff0e08dcf in fi_fini ()
    at /home/louspe01/.conan/data/intel_installation/2023.2.0/louise/test/package/4f459a94dbd4c62b669f92843b0daa45ca1e3751/mpi/2021.10.0//libfabric/lib/libfabric.so.1
#11 0x00007ffff7fc924e in _dl_fini () at ./elf/dl-fini.c:142
#12 0x00007ffff3645495 in __run_exit_handlers
    (status=0, listp=0x7ffff3819838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#13 0x00007ffff3645610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#14 0x00007ffff3629d97 in __libc_start_call_main
    (main=main@entry=0x40261e <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffa848)
    at ../sysdeps/nptl/libc_start_call_main.h:74
#15 0x00007ffff3629e40 in __libc_start_main_impl
    (main=0x40261e <main>, argc=1, argv=0x7fffffffa848, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffa838) at ../csu/libc-start.c:392
#16 0x00000000004024c5 in _start ()
hzhou commented 1 year ago

We used to see this due to psm3 provider in libfabric. I believe they fixed this for a while now. Could you try build the latest version of libfabric or try build the current MPICH from https://github.com/pmodels/mpich using embedded?