intel / mpi-benchmarks

146 stars 63 forks source link

Signal 11 Seg Fault at end of run #16

Closed titanlock closed 5 years ago

titanlock commented 5 years ago

Hello I am trying to do tests with OpenMPI v4.0.0 and was having issues with IMB v2019.1 release and was told by the OpenMPI devs to use this commit as a workaround: https://github.com/intel/mpi-benchmarks/commit/841446d8cf4ca1f607c0f24b9a424ee39ee1f569. This worked fine until the very end when it does what I'm guessing is a cleanup step and will seg fault on one or two machines. Is there any way to get more output for the end of the run? I tried using '-v' but I got nothing more out of it?

Command used:

mpirun -v --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca orte_base_help_aggregate 0  --mca btl openib,vader,self --mca pml ob1 --mca btl_openib_allow_ib 1 -np 8 -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1

Output:

#----------------------------------------------------------------
# Benchmarking Bcast                                                                                                                                                                                                               [1055/98325]
# #processes = 6 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.11         0.12         0.11
            1         1000         1.72         7.06         4.86
            2         1000         1.72         6.85         4.80
            4         1000         1.71         6.92         4.78
            8         1000         1.76         7.12         4.91
           16         1000         1.76         7.18         4.89
           32         1000         1.74         7.17         4.87
           64         1000         1.81         7.58         5.13
          128         1000         1.80         9.27         6.16
          256         1000         1.84         9.54         6.34
          512         1000         2.15        10.70         7.22
         1024         1000         2.35        11.70         7.92
         2048         1000         2.21        15.09        10.10
         4096         1000         3.62        17.32        12.54
         8192         1000         6.17        23.32        17.99
        16384         1000        11.24        37.28        28.67
        32768         1000        62.61        80.91        71.06
        65536          640       109.31       131.24       120.22
       131072          320       225.50       236.59       231.80
       262144          160       430.89       449.17       442.21
       524288           80       406.54       453.22       430.84
      1048576           40       811.17       878.36       842.89
      2097152           20      1788.67      1886.04      1824.92
      4194304           10      2899.46      3183.22      3073.55

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 2 
# ( 4 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         2.30         2.30         2.30

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 4 
# ( 2 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         4.87         4.87         4.87

#---------------------------------------------------
# Benchmarking Barrier 
# #processes = 6                                                                                                                                                                                                                   [1008/98325]
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000         8.54         8.54         8.54

# All processes entering MPI_Finalize

[titan:08194] *** Process received signal ***
[titan:08194] Signal: Segmentation fault (11)
[titan:08194] Signal code: Address not mapped (1)
[titan:08194] Failing at address: 0x10
[titan:08194] [ 0] /lib64/libpthread.so.0(+0xf680)[0x7f0218104680]
[titan:08194] [ 1] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x2a865)[0x7f021777f865]
[titan:08194] [ 2] /opt/openmpi/4.0.0/lib/openmpi/mca_rcache_grdma.so(+0x1fd9)[0x7f020b9defd9]
[titan:08194] [ 3] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x8f)[0x7f021781d55f]
[titan:08194] [ 4] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(+0xeba7)[0x7f020ac73ba7]
[titan:08194] [ 5] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x601)[0x7f020ac6ef91]
[titan:08194] [ 6] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x76213)[0x7f02177cb213]
[titan:08194] [ 7] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f02177b5799]
[titan:08194] [ 8] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f02177b5799]
[titan:08194] [ 9] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_mpi_finalize+0x86f)[0x7f0218367c1f]
[titan:08194] [10] IMB-MPI1[0x4025d4]
[titan:08194] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0217d473d5]
[titan:08194] [12] IMB-MPI1[0x401d59]
[titan:08194] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[pandora:13903] *** Process received signal ***
[pandora:13903] Signal: Segmentation fault (11)
[pandora:13903] Signal code: Address not mapped (1)
[pandora:13903] Failing at address: 0x10
[pandora:13903] [ 0] /lib64/libpthread.so.0(+0xf680)[0x7f68ee599680]
[pandora:13903] [ 1] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x2a865)[0x7f68edc14865]
[pandora:13903] [ 2] /opt/openmpi/4.0.0/lib/openmpi/mca_rcache_grdma.so(+0x1fd9)[0x7f68e1b8bfd9]
[pandora:13903] [ 3] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_rcache_base_module_destroy+0x8f)[0x7f68edcb255f]
[pandora:13903] [ 4] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(+0xeba7)[0x7f68e1548ba7]
[pandora:13903] [ 5] /opt/openmpi/4.0.0/lib/openmpi/mca_btl_openib.so(mca_btl_openib_finalize+0x601)[0x7f68e1543f91]
[pandora:13903] [ 6] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x76213)[0x7f68edc60213]
[pandora:13903] [ 7] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f68edc4a799]
[pandora:13903] [ 8] /opt/openmpi/4.0.0/lib/libopen-pal.so.40(mca_base_framework_close+0x79)[0x7f68edc4a799]
[pandora:13903] [ 9] /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_mpi_finalize+0x86f)[0x7f68ee7fcc1f]
[pandora:13903] [10] IMB-MPI1[0x4025d4]
[pandora:13903] [11] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f68ee1dc3d5]
[pandora:13903] [12] IMB-MPI1[0x401d59]
[pandora:13903] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 8194 on node titan-ib exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
titanlock commented 5 years ago

I was just notified that this issue could be threading related and was told that it might be difficult to reproduce. I would like to get this issue resolved as soon as possible since I am doing testing for the OpenFabrics Alliance at the UNH-IOL. I am trying to get all of their testing done and this is the last thing that needs to be addressed. If you would like to VPN into our cluster to try and solve this issue faster, you can contact me at aleblanc@iol.unh.edu.

Thank you and I hope to hear from you guys soon.

VinnitskiV commented 5 years ago

Hello @titanlock Thank you for your interest in IMB and sorry for delay. IMB have no any options to get more output, so you should ask OpenMPI team about it.