sourceryinstitute / OpenCoarrays

A parallel application binary interface for Fortran 2018 compilers.
http://www.opencoarrays.org
BSD 3-Clause "New" or "Revised" License
247 stars 56 forks source link

Defect: some mpi implementations crash on shutdown when coarrays with allocatable components are used #762

Closed everythingfunctional closed 2 years ago

everythingfunctional commented 2 years ago

System information including:

To help us debug your issue please explain:

What you were trying to do (and why)

Compile and execute the following program.

program hello_coarrays
    implicit none
    type :: array_type
        integer, allocatable :: values(:)
    end type
    type(array_type) :: array[*]
    allocate(array%values(2), source=0)
    array%values = this_image()
    print *, array%values
end program

What happened (include command output, screenshots, logs, etc.)

Homebrew install On MacOS

[Brads-MacBook-Pro:~/tmp/hello_coarrays] which caf
/Users/brad/Repositories/github/sourceryinstitute/OpenCoarrays/prerequisites/installations//opencoarrays/2.10.0/bin/caf
[Brads-MacBook-Pro:~/tmp/hello_coarrays] caf --version

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.10.0-11-gdfde1b9)
Copyright (C) 2015-2022 Sourcery Institute
Copyright (C) 2015-2022 Archaeologic Inc.

OpenCoarrays comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of OpenCoarrays under the terms of the
BSD 3-Clause License.  For more information about these matters, see
the file named LICENSE that is distributed with OpenCoarrays.

[Brads-MacBook-Pro:~/tmp/hello_coarrays] caf hello_coarrays.f90 -o hello_coarrays
ld: warning: dylib (/usr/local/Cellar/gcc/11.3.0_2/lib/gcc/11/libgfortran.dylib) was built for newer macOS version (12.4) than being linked (12.3)
ld: warning: dylib (/usr/local/Cellar/gcc/11.3.0_2/lib/gcc/11/libquadmath.dylib) was built for newer macOS version (12.4) than being linked (12.3)
[Brads-MacBook-Pro:~/tmp/hello_coarrays] cafrun -n 4 ./hello_coarrays
           1           1
           1           1
           1           1
           1           1

Compiled from source and linked to MPICH compiled from source on MacOS

[Brads-MacBook-Pro:~/tmp/hello_coarrays] /usr/local/bin/caf --version

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.10.0)
Copyright (C) 2015-2022 Sourcery Institute
Copyright (C) 2015-2022 Archaeologic Inc.

OpenCoarrays comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of OpenCoarrays under the terms of the
BSD 3-Clause License.  For more information about these matters, see
the file named LICENSE that is distributed with OpenCoarrays.

[Brads-MacBook-Pro:~/tmp/hello_coarrays] /usr/local/bin/caf hello_coarrays.f90 -o hello_coarrays
ld: warning: directory not found for option '-L/usr/local/Cellar/open-mpi/4.1.3/lib'
ld: warning: dylib (/usr/local/Cellar/gcc/11.3.0_2/lib/gcc/11/libgfortran.dylib) was built for newer macOS version (12.4) than being linked (12.3)
ld: warning: dylib (/usr/local/Cellar/gcc/11.3.0_2/lib/gcc/11/libquadmath.dylib) was built for newer macOS version (12.4) than being linked (12.3)
[Brads-MacBook-Pro:~/tmp/hello_coarrays] /usr/local/bin/cafrun -n 4 ./hello_coarrays
           3           3
           4           4
           1           1
           2           2
[Brads-MBP:18994] *** An error occurred in MPI_Win_detach
[Brads-MBP:18994] *** reported by process [238747649,1]
[Brads-MBP:18994] *** on win rdma window 5
[Brads-MBP:18994] *** MPI_ERR_UNKNOWN: unknown error
[Brads-MBP:18994] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[Brads-MBP:18994] ***    and potentially your MPI job)
[Brads-MBP.tx.rr.com:18992] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[Brads-MBP.tx.rr.com:18992] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/local/bin/mpiexec -n 4 ./hello_coarrays`
failed to run.

Compiled from source on Windows and linked with Intel oneAPI MPI

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf --version

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" --version

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.10.0-11-gdfde1b9)
Copyright (C) 2015-2022 Sourcery Institute
Copyright (C) 2015-2022 Archaeologic Inc.

OpenCoarrays comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of OpenCoarrays under the terms of the
BSD 3-Clause License.  For more information about these matters, see
the file named LICENSE that is distributed with OpenCoarrays.

C:\Users\brad\Repositories\GitHub\sourceryinstitute>vim hello_coarrays.f90

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun -n 4 .\hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays
           4           4
           3           3
           1           1
           2           2

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x8ab1344a
#1  0x8ab09303
#2  0x8aaea201
#3  0x4a7c7ff7
#4  0x4c39209e
#5  0x4c341453
#6  0x4c390bcd
#7  0x4c396c3a
#8  0x4c3147b0
#9  0x4a7b9c9b
#10  0x8aad50b5
#11  0x8aac192d
#12  0x8aac13bd
#13  0x8aac14f5
#14  0x4b197033
#15  0x4c342650
#16  0xffffffff

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 2052 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 5516 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 5048 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================
Error: Command:
   `C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays`
failed to run.

Compiled from source on Linux and linked to system openmpi

(base) [pop-os:~/tmp/hello_coarrays] caf --version                           

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.10.0-11-gdfde1b9)
Copyright (C) 2015-2022 Sourcery Institute
Copyright (C) 2015-2022 Archaeologic Inc.

OpenCoarrays comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of OpenCoarrays under the terms of the
BSD 3-Clause License.  For more information about these matters, see
the file named LICENSE that is distributed with OpenCoarrays.

(base) [pop-os:~/tmp/hello_coarrays] caf --show
/usr/bin/gfortran -I/home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/include/OpenCoarrays-2.10.0-11-gdfde1b9_GNU-12.0.1 -fcoarray=lib -L/usr/lib/x86_64-linux-gnu/openmpi/lib/fortran/gfortran ${@} /home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/lib/libcaf_mpi.a /usr/lib/x86_64-linux-gnu/libmpi_usempif08.so /usr/lib/x86_64-linux-gnu/libmpi_usempi_ignore_tkr.so /usr/lib/x86_64-linux-gnu/libmpi_mpifh.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so /usr/lib/x86_64-linux-gnu/libopen-rte.so /usr/lib/x86_64-linux-gnu/libopen-pal.so /usr/lib/x86_64-linux-gnu/libhwloc.so /usr/lib/x86_64-linux-gnu/libevent_core.so /usr/lib/x86_64-linux-gnu/libevent_pthreads.so /usr/lib/x86_64-linux-gnu/libm.so /usr/lib/x86_64-linux-gnu/libz.so
(base) [pop-os:~/tmp/hello_coarrays] caf hello_coarrays.f90 -o hello_coarrays
(base) [pop-os:~/tmp/hello_coarrays] cafrun -n 4 ./hello_coarrays            
           1           1
           2           2
           3           3
           4           4
[pop-os:84252] *** An error occurred in MPI_Win_detach
[pop-os:84252] *** reported by process [2910650369,0]
[pop-os:84252] *** on win rdma window 5
[pop-os:84252] *** MPI_ERR_UNKNOWN: unknown error
[pop-os:84252] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[pop-os:84252] ***    and potentially your MPI job)
[pop-os:84243] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[pop-os:84243] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Error: Command:
   `/usr/bin/mpiexec -n 4 ./hello_coarrays`
failed to run.

Compiled from source and linked to MPICH compiled from source on Linux

(base) [pop-os:~/tmp/hello_coarrays] caf --version

OpenCoarrays Coarray Fortran Compiler Wrapper (caf version 2.10.0-11-gdfde1b9)
Copyright (C) 2015-2022 Sourcery Institute
Copyright (C) 2015-2022 Archaeologic Inc.

OpenCoarrays comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of OpenCoarrays under the terms of the
BSD 3-Clause License.  For more information about these matters, see
the file named LICENSE that is distributed with OpenCoarrays.

(base) [pop-os:~/tmp/hello_coarrays] caf --show
/usr/bin/gfortran -I/home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/include/OpenCoarrays-2.10.0-11-gdfde1b9_GNU-12.0.1 -fcoarray=lib -Wl,-rpath -Wl,/home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/lib -Wl,--enable-new-dtags ${@} /home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/lib/libcaf_mpi.a /home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/lib/libmpifort.so /home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/lib/libmpi.so
(base) [pop-os:~/tmp/hello_coarrays] caf hello_coarrays.f90 -o hello_coarrays
(base) [pop-os:~/tmp/hello_coarrays] cafrun -n 4 ./hello_coarrays
           1           1
           1           1
           1           1
           1           1

Compiled from source on Linux and linked with Intel oneAPI MPI

[pop-os:~/tmp/hello_coarrays] caf --show
/usr/bin/gfortran -I/home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/include/OpenCoarrays-2.10.0-11-gdfde1b9_GNU-12.0.1 -fcoarray=lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /opt/intel/oneapi/mpi/2021.6.0/lib/release -Xlinker -rpath -Xlinker /opt/intel/oneapi/mpi/2021.6.0/lib -Xlinker --enable-new-dtags ${@} /home/brad/Repositories/GitHub/sourceryinstitute/OpenCoarrays/prerequisites/installations/opencoarrays/2.10.0/lib/libcaf_mpi.a /opt/intel/oneapi/mpi/2021.6.0/lib/libmpifort.so /opt/intel/oneapi/mpi/2021.6.0/lib/release/libmpi.so /usr/lib/x86_64-linux-gnu/librt.a /usr/lib/x86_64-linux-gnu/libpthread.a /usr/lib/x86_64-linux-gnu/libdl.a
[pop-os:~/tmp/hello_coarrays] caf hello_coarrays.f90 -o hello_coarrays
[pop-os:~/tmp/hello_coarrays] cafrun -n 4 ./hello_coarrays
           1           1
           2           2
           3           3
           4           4
double free or corruption (out)
free(): invalid pointer

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
free(): invalid pointer

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
free(): invalid pointer

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f512fd47ae0 in ???
#1  0x7f512fd46c45 in ???
#2  0x7f512fb3e51f in ???
    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#0  0x7ff00df47ae0 in ???
#1  0x7ff00df46c45 in ???
#0  0x7f4979f47ae0 in ???
#1  0x7f4979f46c45 in ???
#0  0x7f5023547ae0 in ???
#1  0x7f5023546c45 in ???
#2  0x7f4979d3e51f in ???
    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#2  0x7ff00dd3e51f in ???
    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#2  0x7f502333e51f in ???
    at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3  0x7f512fb92a7c in __pthread_kill_implementation
    at ./nptl/pthread_kill.c:44
#4  0x7f512fb92a7c in __pthread_kill_internal
    at ./nptl/pthread_kill.c:78
#5  0x7f512fb92a7c in __GI___pthread_kill
    at ./nptl/pthread_kill.c:89
#6  0x7f512fb3e475 in __GI_raise
    at ../sysdeps/posix/raise.c:26
#7  0x7f512fb247f2 in __GI_abort
    at ./stdlib/abort.c:79
#3  0x7ff00dd92a7c in __pthread_kill_implementation
    at ./nptl/pthread_kill.c:44
#4  0x7ff00dd92a7c in __pthread_kill_internal
#3  0x7f4979d92a7c in __pthread_kill_implementation
    at ./nptl/pthread_kill.c:44
#4  0x7f4979d92a7c in __pthread_kill_internal
    at ./nptl/pthread_kill.c:78
#5  0x7ff00dd92a7c in __GI___pthread_kill
    at ./nptl/pthread_kill.c:89
    at ./nptl/pthread_kill.c:78
#5  0x7f4979d92a7c in __GI___pthread_kill
    at ./nptl/pthread_kill.c:89
#6  0x7f4979d3e475 in __GI_raise
    at ../sysdeps/posix/raise.c:26
#6  0x7ff00dd3e475 in __GI_raise
    at ../sysdeps/posix/raise.c:26
#3  0x7f5023392a7c in __pthread_kill_implementation
    at ./nptl/pthread_kill.c:44
#4  0x7f5023392a7c in __pthread_kill_internal
    at ./nptl/pthread_kill.c:78
#5  0x7f5023392a7c in __GI___pthread_kill
    at ./nptl/pthread_kill.c:89
#6  0x7f502333e475 in __GI_raise
    at ../sysdeps/posix/raise.c:26
#7  0x7f4979d247f2 in __GI_abort
    at ./stdlib/abort.c:79
#7  0x7ff00dd247f2 in __GI_abort
    at ./stdlib/abort.c:79
#8  0x7f512fb856f5 in __libc_message
    at ../sysdeps/posix/libc_fatal.c:155
#7  0x7f50233247f2 in __GI_abort
    at ./stdlib/abort.c:79
#8  0x7f4979d856f5 in __libc_message
    at ../sysdeps/posix/libc_fatal.c:155
#8  0x7ff00dd856f5 in __libc_message
    at ../sysdeps/posix/libc_fatal.c:155
#8  0x7f50233856f5 in __libc_message
    at ../sysdeps/posix/libc_fatal.c:155
#9  0x7f512fb9cd7b in malloc_printerr
    at ./malloc/malloc.c:5664
#10  0x7f512fb9eeef in _int_free
    at ./malloc/malloc.c:4588
#11  0x7f512fba14d2 in __GI___libc_free
    at ./malloc/malloc.c:3391
#12  0x555a6c656c68 in ???
#13  0x555a6c64fa48 in ???
#9  0x7f502339cd7b in malloc_printerr
    at ./malloc/malloc.c:5664
#10  0x7f502339eac3 in _int_free
    at ./malloc/malloc.c:4439
#9  0x7f4979d9cd7b in malloc_printerr
    at ./malloc/malloc.c:5664
#10  0x7f4979d9eac3 in _int_free
    at ./malloc/malloc.c:4439
#11  0x7f4979da14d2 in __GI___libc_free
    at ./malloc/malloc.c:3391
#12  0x5642223bcc68 in ???
#13  0x5642223b5a48 in ???
#11  0x7f50233a14d2 in __GI___libc_free
    at ./malloc/malloc.c:3391
#12  0x556567f1dc68 in ???
#13  0x556567f16a48 in ???
#9  0x7ff00dd9cd7b in malloc_printerr
    at ./malloc/malloc.c:5664
#10  0x7ff00dd9eac3 in _int_free
    at ./malloc/malloc.c:4439
#11  0x7ff00dda14d2 in __GI___libc_free
    at ./malloc/malloc.c:3391
#12  0x5607988c3c68 in ???
#13  0x5607988bca48 in ???
#14  0x7f512fb25d8f in __libc_start_call_main
    at ../sysdeps/nptl/libc_start_call_main.h:58
#15  0x7f512fb25e3f in __libc_start_main_impl
    at ../csu/libc-start.c:392
#16  0x555a6c64f584 in ???
#17  0xffffffffffffffff in ???
#14  0x7f4979d25d8f in __libc_start_call_main
    at ../sysdeps/nptl/libc_start_call_main.h:58
#15  0x7f4979d25e3f in __libc_start_main_impl
    at ../csu/libc-start.c:392
#16  0x5642223b5584 in ???
#17  0xffffffffffffffff in ???
#14  0x7f5023325d8f in __libc_start_call_main
    at ../sysdeps/nptl/libc_start_call_main.h:58
#15  0x7f5023325e3f in __libc_start_main_impl
    at ../csu/libc-start.c:392
#16  0x556567f16584 in ???
#17  0xffffffffffffffff in ???
#14  0x7ff00dd25d8f in __libc_start_call_main
    at ../sysdeps/nptl/libc_start_call_main.h:58
#15  0x7ff00dd25e3f in __libc_start_main_impl
    at ../csu/libc-start.c:392
#16  0x5607988bc584 in ???
#17  0xffffffffffffffff in ???

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 224054 RUNNING AT pop-os
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 224055 RUNNING AT pop-os
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 224056 RUNNING AT pop-os
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 224057 RUNNING AT pop-os
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
Error: Command:
   `/opt/intel/oneapi/mpi/2021.6.0/bin/mpiexec -n 4 ./hello_coarrays`
failed to run.

What you expected to happen

Ideally, the mpi implementation shouldn't impact whether a crash occurs

Step-by-step reproduction instructions to reproduce the error/bug

Link/use an mpi implementation other than MPICH

everythingfunctional commented 2 years ago

Note: this is working to identify the root cause of #626

everythingfunctional commented 2 years ago

I'll also note that while MPICH doesn't crash, it doesn't seem to provide the right answers. And while the other MPI implementations crash on program termination, they all get the right answers.

vehre commented 2 years ago

On Linux using mpich or openmpi both no longer crash/print wrong results with #763 . Please decide whether you want the useless testcase added (it can not test for the issue of openmpi crashing, because the return code of the test is not considered in the test framework).

everythingfunctional commented 2 years ago

I'll also note that while MPICH doesn't crash, it doesn't seem to provide the right answers. And while the other MPI implementations crash on program termination, they all get the right answers.

Per my comment on #763, I figured out what was happening. cafrun was still using mpiexec from openmpi, even though caf did link the executable to mpich. Perhaps something to look into :man_shrugging:?

everythingfunctional commented 2 years ago

763 does in fact fix the crashes with openmpi, but Intel mpi still crashes. It seems there is more work to do.

vehre commented 2 years ago

Yes, you are right. The Intel MPI library reQuires memory allocated using MPI_Alloc_mem to be freed by MPI_Free_mem. mpich and openmpi do not care, therefore the memory was previously freed using free(). Changing this to MPI_Free_mem makes Intel MPI happy and mpich and openmpi still don't care. I have updated #763 to hopefully mirror all MPI_Alloc_mems by MPI_Free_mems. At least all tests pass. Please give it another try.

everythingfunctional commented 2 years ago

After #763, we now only crash on Windows with the following:

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf --show

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" --show
C:/Users/brad/gcc/bin/gfortran.exe -I/c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/include/OpenCoarrays-2.10.0-14-g9d4afcb_GNU-12.1.0 -fcoarray=lib ${@} /c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/lib/libcaf_mpi.a -pthread C:/Program Files (x86)/Intel/oneAPI/mpi/latest/lib/release/impi.lib

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun --show

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" --show
C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n <number_of_images> /path/to/coarray_Fortran_program [arg4 [arg5 [...]]]

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" hello_coarrays.f90 -o hello_coarrays

C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun -n 4 .\hello_coarrays.exe

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays.exe
           1           1
           3           3
           4           4
           2           2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 6520 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1073740940 (c0000374)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 444 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 4796 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 8996 RUNNING AT BRADRICHARD5FC1
=   EXIT STATUS: -1 (ffffffff)
===================================================================================
Error: Command:
   `C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays.exe`
failed to run.
chris-nrc commented 2 years ago

Edit: I see MPI3 is required from the installation instructions, so this may not be applicable.

I still get errors on linux with 9123d92. The output looks correct though.

Pop!_OS 22.04 LTS GCC 12.1.0 MPICH 4.0.2 CAF 2.10.0-15-g9123d92 tools compiled from source

(base) chris@pop-os:~/projects/caftest$ cafrun -np 4 ./hello_coarrays
           1           1
           2           2
           3           3
           4           4

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f78ade259f2 in ???
#1  0x7f78ade24b85 in ???
#2  0x7f78ada4251f in ???
#0  0x7f976fa259f2 in ???
#1  0x7f976fa24b85 in ???
#2  0x7f976f64251f in ???
#3  0x7f78ae2a4a2f in ???
#4  0x7f78ae5a4164 in ???
#3  0x7f976fea4a2f in ???
#4  0x7f97701a4164 in ???
#5  0x4139ac in ???
#6  0x4029f4 in ???
#7  0x7f976f629d8f in ???
#5  0x4139ac in ???
#6  0x4029f4 in ???
#7  0x7f78ada29d8f in ???
#8  0x7f78ada29e3f in ???
#9  0x402554 in ???
#10  0xffffffffffffffff in ???
#8  0x7f976f629e3f in ???
#9  0x402554 in ???
#10  0xffffffffffffffff in ???
#0  0x7f9755a259f2 in ???
#1  0x7f9755a24b85 in ???
#2  0x7f975564251f in ???
#3  0x7f9755ea4a2f in ???
#4  0x7f97561a4164 in ???
#5  0x4139ac in ???
#0  0x7f0674c259f2 in ???
#1  0x7f0674c24b85 in ???
#2  0x7f067484251f in ???
#3  0x7f06750a4a2f in ???
#4  0x7f06753a4164 in ???
#6  0x4029f4 in ???
#7  0x7f9755629d8f in ???
#8  0x7f9755629e3f in ???
#9  0x402554 in ???
#10  0xffffffffffffffff in ???
#5  0x4139ac in ???
#6  0x4029f4 in ???
#7  0x7f0674829d8f in ???
#8  0x7f0674829e3f in ???
#9  0x402554 in ???
#10  0xffffffffffffffff in ???

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 10689 RUNNING AT pop-os
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Error: Command:
   `/home/chris/.mpich-4.0.2/bin/mpiexec -n 4 ./hello_coarrays`
failed to run.
vehre commented 2 years ago

@everythingfunctional I took another deep look at the issue and had severe problems to get this running on my Windows VM. I can compile there, but can't run a single MPI program using the intel oneAPI. I nearly gave up.

I took a last look on Linux and found memory that was allocated using calloc but freeed using MPI_Free_mem. I fixed that in https://github.com/sourceryinstitute/OpenCoarrays/pull/766 . May I kindly ask you to check whether this changes makes things worse or better?

everythingfunctional commented 2 years ago

@vehre , thanks for continuing to look into this. The change in #766 appears to have made things worse. the output now looks like below. Question, should that memory have been allocated using MPI instead of calloc? I.e. instead of changing MPI_Free_mem to free, change the corresponding calloc to MPI_something?

C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays.exe
           2           2
           4           4
           1           1
           3           3

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x27d9348a
#1  0x27d89343
#2  0x27d6a241
#3  0x8dfc7ff7
#4  0x8feb229e
#5  0x8fe61453
#6  0x8feb0dcd
#7  0x27d81583
#8  0x27d55161
#9  0x27d4192d
#10  0x27d413bd
#11  0x27d414f5
#12  0x8fb77033

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x27d9348a
#1  0x27d89343
#2  0x27d6a241
#3  0x8dfc7ff7
#4  0x8feb229e
#5  0x8fe61453
#6  0x8feb0dcd
#7  0x27d81583
#8  0x27d55161
#9  0x27d4192d
#10  0x27d413bd
#11  0x27d414f5
#12  0x8fb77033
#13  0x8fe62650
#14  0xffffffff
#0  0x27d9348a
#1  0x27d89343
#2  0x27d6a241
#3  0x8dfc7ff7
#4  0x8feb229e
#5  0x8fe61453
#6  0x8feb0dcd
#7  0x27d81583
#8  0x27d55161
#9  0x27d4192d
#10  0x27d413bd
#11  0x27d414f5
#12  0x8fb77033
#13  0x8fe62650
#14  0xffffffff
#13  0x8fe62650
#14  0xffffffff
#0  0x27d9348a
#1  0x27d89343
#2  0x27d6a241
#3  0x8dfc7ff7
#4  0x8feb229e
#5  0x8fe61453
#6  0x8feb0dcd
#7  0x27d81583
#8  0x27d55161
#9  0x27d4192d
#10  0x27d413bd
#11  0x27d414f5
#12  0x8fb77033
#13  0x8fe62650
#14  0xffffffff
Error: Command:
   `C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays.exe`
failed to run.
vehre commented 2 years ago

That's odd. Unfortunately is no function name or line number provided in your stack dumps. Can you recompile with debug information and re-run the program, so that I might get an idea of where the issue occurs? For a debug build you just need to provide -DCMAKE_BUILD_TYPE=Debug to the cmake configure call.

vehre commented 2 years ago

I managed to get OpenCoarrays running with openAPI MPI on Win10 on a bare metal notebook. The applications crashed there, too. I found a mutex, that was not initialized and open unlock made the app crash on Windows. The mutex is needed by code, that seems to be deactivated by a preprocessor symbol "HELPER". On a Quick glance it was introduced with some work on strided datatypes, but the function it is used in, is not referenced any where in the current code base. Therefore probably a candidate for removal. I have used the same preprocessor symbol now for preventing access to the mutex, resolving the crash. Please check updated PR #766 .

everythingfunctional commented 2 years ago

It seems that with #766, all of these crashes have now been fixed. Awesome work @vehre :tada: