On Cori using UGNI I'm consistently getting this error at the same point of the execution when running HPL using 8 haswell nodes & 256 processes:
mpirun -npernode 32 --bind-to core -np 256 xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 60000
NB : 500
PMAP : Row-major process mapping
P : 16
Q : 16
PFACT : Crout
NBMIN : 2
NDIV : 2
RFACT : Crout
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
Warning :: opal_list_remove_item - the item 0x879f80 is not on the list 0x876ac0
[nid00202][[32284,1],151][btl_ugni_module.c:356:mca_btl_ugni_device_handle_event_error] giving up on desciptor 0x272f440, recoverable 0: SOURCE_SSID:AT_PF_INV:CPLTN_SRSP
desc->gni_desc.post_id = 0
desc->gni_desc.status = b
desc->gni_desc.cq_mode_complete = 33044
desc->gni_desc.type = 2
desc->gni_desc.cq_mode = 2
desc->gni_desc.dlvr_mode = 0
desc->gni_desc.local_addr = 13295e8
desc->gni_desc.local_mem_hndl = {d2affffffff3b6ea, 84077e40c00c5cd7}
desc->gni_desc.remote_addr = 33e6768
desc->gni_desc.remote_mem_hndl = {7ffffffffff56592, bb077e40c00acece}
desc->gni_desc.length = 496000
desc->gni_desc.rdma_mode = 0
desc->gni_desc.amo_cmd = -42380
xhpl: pml_ob1_sendreq.h:219: mca_pml_ob1_send_request_fini: Assertion `((void *)0) == sendreq->rdma_frag' failed.
[nid00208:03799] *** Process received signal ***
[nid00208:03799] Signal: Aborted (6)
[nid00208:03799] Signal code: (-6)
[nid00208:03799] [ 0] /lib64/libc.so.6(+0x34fe0)[0x2aaaac706fe0]
[nid00208:03799] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaac706f67]
[nid00208:03799] [ 2] /lib64/libc.so.6(abort+0x13a)[0x2aaaac70833a]
[nid00208:03799] [ 3] /lib64/libc.so.6(+0x2dd66)[0x2aaaac6ffd66]
[nid00208:03799] [ 4] /lib64/libc.so.6(+0x2de12)[0x2aaaac6ffe12]
[nid00208:03799] [ 5] /global/homes/n/nlosada/ompi/Build/lib/openmpi/mca_pml_ob1.so(+0x11855)[0x2aaac20fc855]
[nid00208:03799] [ 6] /global/homes/n/nlosada/ompi/Build/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x644)[0x2aaac20fde4d]
[nid00208:03799] [ 7] /global/homes/n/nlosada/ompi/Build/lib/libmpi.so.0(PMPI_Send+0x2b0)[0x2aaaabfb538a]
[nid00208:03799] [ 8] xhpl[0x41ba8b]
[nid00208:03799] [ 9] xhpl[0x4112c7]
[nid00208:03799] [10] xhpl[0x40f493]
[nid00208:03799] [11] xhpl[0x40f9bf]
[nid00208:03799] [12] xhpl[0x40d625]
[nid00208:03799] [13] xhpl[0x4060af]
[nid00208:03799] [14] xhpl[0x401a1f]
[nid00208:03799] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac6f2725]
[nid00208:03799] [16] xhpl[0x401de9]
[nid00208:03799] *** End of error message ***
[nid00208:03798] *** Process received signal ***
OpenMPI master commit 5bd90ee548a8168982dbb62d0e38ed261d18bffa, build as follows:
On Cori using UGNI I'm consistently getting this error at the same point of the execution when running HPL using 8 haswell nodes & 256 processes:
OpenMPI master commit 5bd90ee548a8168982dbb62d0e38ed261d18bffa, build as follows:
module load PrgEnv-gnu module rm cray-mpich/7.7.3 module rm darshan/3.1.4 module rm cray-libsci/18.07.1 module rm PrgEnv-intel/6.0.4
./autogen.pl ./configure --prefix=${DIR}/Build \ --enable-debug \ -enable-orterun-prefix-by-default \ --disable-java --disable-oshmem --enable-shared --disable-static \ CC=cc FC=ftn CXX=CC LDFLAGS=-dynamic --without-xpmem --without-verbs make -j24 install
HPL https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
./configure --prefix=$PWD/Build \ CC=${MPI_HOME}/bin/mpicc \ LDFLAGS="-dynamic" make install
HPL problem input: