open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

Error running HPL on Cori #6829

Open nuriallv opened 5 years ago

nuriallv commented 5 years ago

On Cori using UGNI I'm consistently getting this error at the same point of the execution when running HPL using 8 haswell nodes & 256 processes:

mpirun -npernode 32 --bind-to core -np 256 xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   60000 
NB     :     500 
PMAP   : Row-major process mapping
P      :      16 
Q      :      16 
PFACT  :   Crout 
NBMIN  :       2 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :   1ring 
DEPTH  :       0 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

 Warning :: opal_list_remove_item - the item 0x879f80 is not on the list 0x876ac0 
[nid00202][[32284,1],151][btl_ugni_module.c:356:mca_btl_ugni_device_handle_event_error] giving up on desciptor 0x272f440, recoverable 0: SOURCE_SSID:AT_PF_INV:CPLTN_SRSP
desc->gni_desc.post_id          = 0
desc->gni_desc.status           = b
desc->gni_desc.cq_mode_complete = 33044
desc->gni_desc.type             = 2
desc->gni_desc.cq_mode          = 2
desc->gni_desc.dlvr_mode        = 0
desc->gni_desc.local_addr       = 13295e8
desc->gni_desc.local_mem_hndl   = {d2affffffff3b6ea, 84077e40c00c5cd7}
desc->gni_desc.remote_addr      = 33e6768
desc->gni_desc.remote_mem_hndl  = {7ffffffffff56592, bb077e40c00acece}
desc->gni_desc.length           = 496000
desc->gni_desc.rdma_mode        = 0
desc->gni_desc.amo_cmd          = -42380
xhpl: pml_ob1_sendreq.h:219: mca_pml_ob1_send_request_fini: Assertion `((void *)0) == sendreq->rdma_frag' failed.
[nid00208:03799] *** Process received signal ***
[nid00208:03799] Signal: Aborted (6)
[nid00208:03799] Signal code:  (-6)
[nid00208:03799] [ 0] /lib64/libc.so.6(+0x34fe0)[0x2aaaac706fe0]
[nid00208:03799] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaac706f67]
[nid00208:03799] [ 2] /lib64/libc.so.6(abort+0x13a)[0x2aaaac70833a]
[nid00208:03799] [ 3] /lib64/libc.so.6(+0x2dd66)[0x2aaaac6ffd66]
[nid00208:03799] [ 4] /lib64/libc.so.6(+0x2de12)[0x2aaaac6ffe12]
[nid00208:03799] [ 5] /global/homes/n/nlosada/ompi/Build/lib/openmpi/mca_pml_ob1.so(+0x11855)[0x2aaac20fc855]
[nid00208:03799] [ 6] /global/homes/n/nlosada/ompi/Build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x644)[0x2aaac20fde4d]
[nid00208:03799] [ 7] /global/homes/n/nlosada/ompi/Build/lib/libmpi.so.0(PMPI_Send+0x2b0)[0x2aaaabfb538a]
[nid00208:03799] [ 8] xhpl[0x41ba8b]
[nid00208:03799] [ 9] xhpl[0x4112c7]
[nid00208:03799] [10] xhpl[0x40f493]
[nid00208:03799] [11] xhpl[0x40f9bf]
[nid00208:03799] [12] xhpl[0x40d625]
[nid00208:03799] [13] xhpl[0x4060af]
[nid00208:03799] [14] xhpl[0x401a1f]
[nid00208:03799] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac6f2725]
[nid00208:03799] [16] xhpl[0x401de9]
[nid00208:03799] *** End of error message ***
[nid00208:03798] *** Process received signal ***

OpenMPI master commit 5bd90ee548a8168982dbb62d0e38ed261d18bffa, build as follows:

module load PrgEnv-gnu module rm cray-mpich/7.7.3 module rm darshan/3.1.4 module rm cray-libsci/18.07.1 module rm PrgEnv-intel/6.0.4

./autogen.pl ./configure --prefix=${DIR}/Build \ --enable-debug \ -enable-orterun-prefix-by-default \ --disable-java --disable-oshmem --enable-shared --disable-static \ CC=cc FC=ftn CXX=CC LDFLAGS=-dynamic --without-xpmem --without-verbs make -j24 install

HPL https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz

./configure --prefix=$PWD/Build \ CC=${MPI_HOME}/bin/mpicc \ LDFLAGS="-dynamic" make install

HPL problem input:

cat hpl-2.3/Build/bin/HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
60000           Ns 5k per node
1            # of NBs
500          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
16            Ps
16            Qs
16.0         threshold
1            # of panel fact
1            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
hppritcha commented 5 years ago

The

SOURCE_SSID:AT_PF_INV:CPLTN_SRSP

error code is indicating something is wrong with the source buffer. The descriptor type is 2, so its an RDMA read initiated by the receiver process.