Open tenbrugg opened 8 years ago
Changed to title to reflect that this is a general problem for mini apps at high rank count, not just limited to SNAP.
@sungeunchoi what configure options are you using to build open mpi for these mini-app builds?
@hppritcha Other than --prefix
and --with-libfabric
, I use --enable-mpi-thread-multiple --disable-dlopen --with-verbs=no
.
This issue is intended to track the OpenMPI seg fault problem discussed last week. When running SNAP with OpenMPI on KNL using 1024 ranks the application seg faults during initialization. This problem does not occur when running with MPICH instead.
srun -n 1024 -N 16 --cpu_bind=none --hint=nomultithread --exclusive ../../../SNAP/src/gsnap 1024tasksSTlibfab.input
Stock nightly libfabric and OpenMPI libraries are used from Sung's install directory. More details can be supplied if desired.
Core was generated by `/cray/css/u19/c17581/snap/nersc/SNAPJune13/small/../../../SNAP/src/gsnap 1024ta'. Program terminated with signal 11, Segmentation fault.
0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20,
537 remote_addr = endpoint->peer_fiaddr;
(gdb) where
0 0x00007ffff67112cf in ompi_mtl_ofi_irecv (mtl=0x7ffff6a50d20, comm=
1 0x00007ffff6774cb1 in mca_pml_cm_irecv (addr=0x930530, count=1, datatype=
2 0x00007ffff66635a2 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x7fffffff6d24, rbuf=
3 0x00007ffff666d03a in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7fffffff6d24, rbuf=
4 0x00007ffff65a6f62 in ompi_comm_allreduce_intra (inbuf=0x7fffffff6d24, outbuf=0x7fffffff6d28, count=
5 0x00007ffff65a5963 in ompi_comm_nextcid (newcomm=0x932490, comm=
6 0x00007ffff65a2875 in ompi_comm_dup_with_info (comm=0x7ffff6a59b20, info=0x0,
7 0x00007ffff65a2760 in ompi_comm_dup (comm=0x7ffff6a59b20, newcomm=
8 0x00007ffff65f031c in PMPI_Comm_dup (comm=0x7ffff6a59b20, newcomm=
9 0x00007ffff6ab86e0 in ompi_comm_dup_f (comm=0x43a858, newcomm=
10 0x0000000000404410 in __plib_module_MOD_pinit ()
11 0x00000000004023bc in MAIN__ ()
12 0x00000000004021fd in main ()