Closed everythingfunctional closed 2 years ago
Note: this is working to identify the root cause of #626
I'll also note that while MPICH doesn't crash, it doesn't seem to provide the right answers. And while the other MPI implementations crash on program termination, they all get the right answers.
On Linux using mpich or openmpi both no longer crash/print wrong results with #763 . Please decide whether you want the useless testcase added (it can not test for the issue of openmpi crashing, because the return code of the test is not considered in the test framework).
I'll also note that while MPICH doesn't crash, it doesn't seem to provide the right answers. And while the other MPI implementations crash on program termination, they all get the right answers.
Per my comment on #763, I figured out what was happening. cafrun was still using mpiexec from openmpi, even though caf did link the executable to mpich. Perhaps something to look into :man_shrugging:?
Yes, you are right. The Intel MPI library reQuires memory allocated using MPI_Alloc_mem to be freed by MPI_Free_mem. mpich and openmpi do not care, therefore the memory was previously freed using free(). Changing this to MPI_Free_mem makes Intel MPI happy and mpich and openmpi still don't care. I have updated #763 to hopefully mirror all MPI_Alloc_mems by MPI_Free_mems. At least all tests pass. Please give it another try.
After #763, we now only crash on Windows with the following:
C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf --show
C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" --show
C:/Users/brad/gcc/bin/gfortran.exe -I/c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/include/OpenCoarrays-2.10.0-14-g9d4afcb_GNU-12.1.0 -fcoarray=lib ${@} /c/Users/brad/Repositories/GitHub/sourceryinstitute/opencoarrays-install/lib/libcaf_mpi.a -pthread C:/Program Files (x86)/Intel/oneAPI/mpi/latest/lib/release/impi.lib
C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun --show
C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" --show
C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n <number_of_images> /path/to/coarray_Fortran_program [arg4 [arg5 [...]]]
C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\caf hello_coarrays.f90 -o hello_coarrays
C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\caf" hello_coarrays.f90 -o hello_coarrays
C:\Users\brad\Repositories\GitHub\sourceryinstitute>opencoarrays-install\bin\cafrun -n 4 .\hello_coarrays.exe
C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays.exe
1 1
3 3
4 4
2 2
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 6520 RUNNING AT BRADRICHARD5FC1
= EXIT STATUS: -1073740940 (c0000374)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 444 RUNNING AT BRADRICHARD5FC1
= EXIT STATUS: -1 (ffffffff)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 4796 RUNNING AT BRADRICHARD5FC1
= EXIT STATUS: -1 (ffffffff)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 8996 RUNNING AT BRADRICHARD5FC1
= EXIT STATUS: -1 (ffffffff)
===================================================================================
Error: Command:
`C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays.exe`
failed to run.
Edit: I see MPI3 is required from the installation instructions, so this may not be applicable.
I still get errors on linux with 9123d92. The output looks correct though.
Pop!_OS 22.04 LTS GCC 12.1.0 MPICH 4.0.2 CAF 2.10.0-15-g9123d92 tools compiled from source
(base) chris@pop-os:~/projects/caftest$ cafrun -np 4 ./hello_coarrays
1 1
2 2
3 3
4 4
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7f78ade259f2 in ???
#1 0x7f78ade24b85 in ???
#2 0x7f78ada4251f in ???
#0 0x7f976fa259f2 in ???
#1 0x7f976fa24b85 in ???
#2 0x7f976f64251f in ???
#3 0x7f78ae2a4a2f in ???
#4 0x7f78ae5a4164 in ???
#3 0x7f976fea4a2f in ???
#4 0x7f97701a4164 in ???
#5 0x4139ac in ???
#6 0x4029f4 in ???
#7 0x7f976f629d8f in ???
#5 0x4139ac in ???
#6 0x4029f4 in ???
#7 0x7f78ada29d8f in ???
#8 0x7f78ada29e3f in ???
#9 0x402554 in ???
#10 0xffffffffffffffff in ???
#8 0x7f976f629e3f in ???
#9 0x402554 in ???
#10 0xffffffffffffffff in ???
#0 0x7f9755a259f2 in ???
#1 0x7f9755a24b85 in ???
#2 0x7f975564251f in ???
#3 0x7f9755ea4a2f in ???
#4 0x7f97561a4164 in ???
#5 0x4139ac in ???
#0 0x7f0674c259f2 in ???
#1 0x7f0674c24b85 in ???
#2 0x7f067484251f in ???
#3 0x7f06750a4a2f in ???
#4 0x7f06753a4164 in ???
#6 0x4029f4 in ???
#7 0x7f9755629d8f in ???
#8 0x7f9755629e3f in ???
#9 0x402554 in ???
#10 0xffffffffffffffff in ???
#5 0x4139ac in ???
#6 0x4029f4 in ???
#7 0x7f0674829d8f in ???
#8 0x7f0674829e3f in ???
#9 0x402554 in ???
#10 0xffffffffffffffff in ???
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10689 RUNNING AT pop-os
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
Error: Command:
`/home/chris/.mpich-4.0.2/bin/mpiexec -n 4 ./hello_coarrays`
failed to run.
@everythingfunctional I took another deep look at the issue and had severe problems to get this running on my Windows VM. I can compile there, but can't run a single MPI program using the intel oneAPI. I nearly gave up.
I took a last look on Linux and found memory that was allocated using calloc but freeed using MPI_Free_mem. I fixed that in https://github.com/sourceryinstitute/OpenCoarrays/pull/766 . May I kindly ask you to check whether this changes makes things worse or better?
@vehre , thanks for continuing to look into this. The change in #766 appears to have made things worse. the output now looks like below. Question, should that memory have been allocated using MPI instead of calloc? I.e. instead of changing MPI_Free_mem
to free
, change the corresponding calloc
to MPI_something
?
C:\Users\brad\Repositories\GitHub\sourceryinstitute>"C:/Program Files/Git/usr/bin/bash.exe" "C:\Users\brad\Repositories\GitHub\sourceryinstitute\opencoarrays-install\bin\cafrun" -n 4 .\hello_coarrays.exe
2 2
4 4
1 1
3 3
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x27d9348a
#1 0x27d89343
#2 0x27d6a241
#3 0x8dfc7ff7
#4 0x8feb229e
#5 0x8fe61453
#6 0x8feb0dcd
#7 0x27d81583
#8 0x27d55161
#9 0x27d4192d
#10 0x27d413bd
#11 0x27d414f5
#12 0x8fb77033
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x27d9348a
#1 0x27d89343
#2 0x27d6a241
#3 0x8dfc7ff7
#4 0x8feb229e
#5 0x8fe61453
#6 0x8feb0dcd
#7 0x27d81583
#8 0x27d55161
#9 0x27d4192d
#10 0x27d413bd
#11 0x27d414f5
#12 0x8fb77033
#13 0x8fe62650
#14 0xffffffff
#0 0x27d9348a
#1 0x27d89343
#2 0x27d6a241
#3 0x8dfc7ff7
#4 0x8feb229e
#5 0x8fe61453
#6 0x8feb0dcd
#7 0x27d81583
#8 0x27d55161
#9 0x27d4192d
#10 0x27d413bd
#11 0x27d414f5
#12 0x8fb77033
#13 0x8fe62650
#14 0xffffffff
#13 0x8fe62650
#14 0xffffffff
#0 0x27d9348a
#1 0x27d89343
#2 0x27d6a241
#3 0x8dfc7ff7
#4 0x8feb229e
#5 0x8fe61453
#6 0x8feb0dcd
#7 0x27d81583
#8 0x27d55161
#9 0x27d4192d
#10 0x27d413bd
#11 0x27d414f5
#12 0x8fb77033
#13 0x8fe62650
#14 0xffffffff
Error: Command:
`C:/Program Files (x86)/Intel/oneAPI/mpi/latest/bin/mpiexec.exe -n 4 .\hello_coarrays.exe`
failed to run.
That's odd. Unfortunately is no function name or line number provided in your stack dumps. Can you recompile with debug information and re-run the program, so that I might get an idea of where the issue occurs?
For a debug build you just need to provide -DCMAKE_BUILD_TYPE=Debug
to the cmake
configure call.
I managed to get OpenCoarrays running with openAPI MPI on Win10 on a bare metal notebook. The applications crashed there, too. I found a mutex, that was not initialized and open unlock made the app crash on Windows. The mutex is needed by code, that seems to be deactivated by a preprocessor symbol "HELPER". On a Quick glance it was introduced with some work on strided datatypes, but the function it is used in, is not referenced any where in the current code base. Therefore probably a candidate for removal. I have used the same preprocessor symbol now for preventing access to the mutex, resolving the crash. Please check updated PR #766 .
It seems that with #766, all of these crashes have now been fixed. Awesome work @vehre :tada:
System information including:
uname -a
:Darwin Brads-MBP.tx.rr.com 21.4.0 Darwin Kernel Version 21.4.0: Fri Mar 18 00:45:05 PDT 2022; root:xnu-8020.101.4~15/RELEASE_X86_64 x86_64
, Windows 10 installation in Parallels on same Mac, andLinux pop-os 5.17.15-76051715-generic #202206141358~1655919116~22.04~1db9e34 SMP PREEMPT Wed Jun 22 19 x86_64 x86_64 x86_64 GNU/Linux
To help us debug your issue please explain:
What you were trying to do (and why)
Compile and execute the following program.
What happened (include command output, screenshots, logs, etc.)
Homebrew install On MacOS
Compiled from source and linked to MPICH compiled from source on MacOS
Compiled from source on Windows and linked with Intel oneAPI MPI
Compiled from source on Linux and linked to system openmpi
Compiled from source and linked to MPICH compiled from source on Linux
Compiled from source on Linux and linked with Intel oneAPI MPI
What you expected to happen
Ideally, the mpi implementation shouldn't impact whether a crash occurs
Step-by-step reproduction instructions to reproduce the error/bug
Link/use an mpi implementation other than MPICH