ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

Running mini apps with 1k ranks and OpenMPI causes seg fault #883

Open tenbrugg opened 8 years ago

tenbrugg commented 8 years ago

This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.

srun -n 1024 -N 16 --cpu_bind=none --hint=nomultithread --exclusive ../../../SNAP/src/gsnap 1024tasksSTlibfab.input

Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desired.

Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'. Program terminated with signal 11, Segmentation fault.

0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 ,

comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, 
mtl_request=0x80f820) at mtl_ofi.h:537

537 remote_addr = endpoint->peer_fiaddr;

(gdb) where

0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20 , comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, src=119, tag=-12, convertor=0x80f710, mtl_request=0x80f820)
at mtl_ofi.h:537

1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=

0x7ffff6a45040 <ompi_mpi_int>, src=119, tag=-12, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, request=
0x7fffffff6b50) at pml_cm.h:119

2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at base/coll_base_allreduce.c:221

3 0x00007ffff666d03a in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffff6d24, rbuf=

0x7fffffff6d28, count=1, dtype=0x7ffff6a45040 <ompi_mpi_int>, op=0x7ffff6a64720 <ompi_mpi_op_max>, 
comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, module=0x818740) at coll_tuned_decision_fixed.c:66

4 0x00007ffff65a6f62 in ompi_comm_allreduce_intra (inbuf=0x7fffffff6d24, outbuf=0x7fffffff6d28, count=

1, op=0x7ffff6a64720 <ompi_mpi_op_max>, comm=0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, 
local_leader=0x0, remote_leader=0x0, send_first=-1, tag=0x7ffff6798b5e "nextcid", iter=0)
at communicator/comm_cid.c:878

5 0x00007ffff65a5963 in ompi_comm_nextcid (newcomm=0x932490, comm=

0x7ffff6a59b20 <ompi_mpi_comm_world>, bridgecomm=0x0, local_leader=0x0, remote_leader=0x0, mode=32, 
send_first=-1) at communicator/comm_cid.c:221

6 0x00007ffff65a2875 in ompi_comm_dup_with_info (comm=0x7ffff6a59b20 , info=0x0,

newcomm=0x7fffffff6e48) at communicator/comm.c:1037

7 0x00007ffff65a2760 in ompi_comm_dup (comm=0x7ffff6a59b20 , newcomm=

0x7fffffff6e48) at communicator/comm.c:998

8 0x00007ffff65f031c in PMPI_Comm_dup (comm=0x7ffff6a59b20 , newcomm=

0x7fffffff6e48) at pcomm_dup.c:63

9 0x00007ffff6ab86e0 in ompi_comm_dup_f (comm=0x43a858, newcomm=

0x63f46c <__plib_module_MOD_comm_snap>, ierr=0x7fffffff6e78) at pcomm_dup_f.c:76

10 0x0000000000404410 in __plib_module_MOD_pinit ()

11 0x00000000004023bc in MAIN__ ()

12 0x00000000004021fd in main ()

tenbrugg commented 8 years ago

Changed to title to reflect that this is a general problem for mini apps at high rank count, not just limited to SNAP.

hppritcha commented 8 years ago

@sungeunchoi what configure options are you using to build open mpi for these mini-app builds?

sungeunchoi commented 8 years ago

@hppritcha Other than --prefix and --with-libfabric, I use --enable-mpi-thread-multiple --disable-dlopen --with-verbs=no.