uwsampa / grappa

Grappa: scaling irregular applications on commodity clusters
grappa.io
BSD 3-Clause "New" or "Revised" License
159 stars 51 forks source link

Exiting due to signal 11 with siginfo 0x... and payload 0x... #265

Open jeffhammond opened 8 years ago

jeffhammond commented 8 years ago

Is this to be expected? This is a dual-socket Intel Xeon 2699v3 (Haswell) workstation, if it matters.

$ uname -a
Linux esgmonster 2.6.32-573.12.1.el6.centos.plus.x86_64 #1 SMP Wed Dec 16 16:48:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ mpicxx -show
g++ -I/opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64/include 
-L/opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64/lib/release_mt 
-L/opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64/lib -Xlinker 
--enable-new-dtags 
-Xlinker -rpath -Xlinker /opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64/lib/release_mt 
-Xlinker -rpath -Xlinker /opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64/lib 
-Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.1/intel64/lib/release_mt 
-Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.1/intel64/lib 
-lmpicxx -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread
rm -rf *
MPI_ROOT=/opt/intel/compilers_and_libraries_2016.0.109/linux/mpi/intel64
cmake .. -DGRAPPA_INSTALL_PREFIX=/opt/grappa/$COMPILER \
                     -DCMAKE_C_COMPILER="$MPI_ROOT/bin/mpicc" \
                     -DCMAKE_CXX_COMPILER="$MPI_ROOT/bin/mpicxx" \
                     -DMPI_C_COMPILER="$MPI_ROOT/bin/mpicc" \
                     -DMPI_CXX_COMPILER="$MPI_ROOT/bin/mpicxx"
[jrhammon@esgmonster github-official]$ mpirun GRAPPA/Synch_p2p/p2p 10 $((32*36)) 32
. . . 
Parallel Research Kernels version 2.16
Grappa pipeline execution on 2D grid
Number of processes            = 36
Grid sizes                     = 1152x32
Number of iterations           = 10
Solution validates
Rate (MFlops/s): 18.1586  Avg time (s): 0.00392993
Exiting due to signal 11 with siginfo 0x400340f066f0 and payload 0x400340f065c0
[jrhammon@esgmonster github-official]$ mpirun GRAPPA/Stencil/stencil 10 $((32*36))
. . .
Parallel Research Kernels version 2.16
Grappa stencil execution on 2D grid
Number of cores        = 36
Grid size              = 1152
Radius of stencil      = 2
Tiles in x/y-direction = 6/6
Type of stencil        = star
Data type              = double precision
Compact representation of stencil loop body
Number of iterations   = 10
Solution validates
Rate (MFlops/s): 10048.7  Avg time (s): 0.00249188
Exiting due to signal 11 with siginfo 0x400340e406f0 and payload 0x400340e405c0
[jrhammon@esgmonster github-official]$ mpirun GRAPPA/Transpose/transpose 10 $((32*36)) 32
. . .
Parallel Research Kernels version 2.16
Grappa matrix transpose: B = A^T
Number of cores         = 36
Matrix order            = 1152
Number of iterations    = 10
Tile size               = 32
Implementation DEPRECATED: result accumulation not yet implemented
Solution validates
Rate (MB/s): -8210.7 Avg time (s): 0.0025861
Exiting due to signal 11 with siginfo 0x40034116e6b0 and payload 0x40034116e580
nelsonje commented 8 years ago

(summarizing off-github discussion)

This shows up on some systems, and comes from the way we clean up after the Boost shared memory gadget we use. I'm working on replacing it.

caiwanli commented 11 months ago

I have the same problem, so how can I solve it? [root@glusterfs-02 demos]# mpirun --allow-run-as-root -n 2 ./hello_world.exe

WARNING: No preset parameters were found for the device that Open MPI detected:

Local host: glusterfs-02 Device name: i40iw0 Device vendor ID: 0x8086 Device vendor part ID: 14291

Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device.

NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0.

Exiting due to signal 11 with siginfo 0x7ffc14f5ddf0 and payload 0x7ffc14f5dcc0

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[54189,1],0] Exit code: 1

[glusterfs-02:26194] 5 more processes have sent help message help-mpi-btl-openib.txt / no device params found [glusterfs-02:26194] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [root@glusterfs-02 demos]# uname -a Linux glusterfs-02 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux