open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

OpenMPI 4.0.1 crashing ... #6981

Closed miroi closed 5 years ago

miroi commented 5 years ago

Hello,

my OpenMPI applications are crashing on our cluster, we do not know if this is due to an old linux kernel. Here is the info:

OpenMPI installed as

milias@login.grid.umb.sk:~/bin/openmpi-4.0.1_suites/openmpi-4.0.1_Intel14_GNU6.3g++/../configure --prefix=$PWD CXX=g++ CC=icc F77=ifort FC=ifort

whith g++ 6.3, ifort/icc 14.01
milias@comp04:~/.uname -a
Linux comp04 2.6.32-754.2.1.el6.x86_64 #1 SMP Tue Jul 10 13:23:59 CDT 2018 x86_64 x86_64 x86_64 GNU/Linux
milias@comp04:~/.mpirun --version
mpirun (Open MPI) 4.0.1

Error by running application:

 Running mpirun -np 8 /home/milias/Work/qch/software/lammps/lammps_stable/src/lmp_mpi -in in.melt
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_segment.c at line 196
[comp04:20835] PMIX ERROR: OUT-OF-RESOURCE in file dstore_base.c at line 538
[comp04:20835] PMIX ERROR: ERROR in file dstore_base.c at line 2414
[comp04:20853] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20853] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20858] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20858] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20854] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20854] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20852] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20852] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20856] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20856] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20855] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20855] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20857] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20857] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
[comp04:20851] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 165
[comp04:20851] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20857] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20854] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20858] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[comp04:20852] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20851] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[comp04:20855] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61382,1],0]
  Exit code:    1
--------------------------------------------------------------------------
[comp04:20835] 7 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[comp04:20835] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[comp04:20835] 7 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[comp04:20835] 7 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[comp04:20835] PMIX ERROR: NOT-FOUND in file gds_ds12_lock_pthread.c at line 199

and on comp04 node the g++ version is lower:

milias@comp04:~/.mpiCC --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-23)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
miroi commented 5 years ago

Well, on the main node the application is running (there we have g++ 6.3 due to installed devtoolset ):

milias@login.grid.umb.sk:~/Work/open-collection/theoretical_chemistry/software_runs/lammps/runs/melt/.mpirun -np 4 /home/milias/Work/qch/software/lam
mps/lammps_stable/src/lmp_mpi -in in.melt 
LAMMPS (7 Aug 2019)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (16.796 16.796 16.796)
  1 by 2 by 2 MPI processor grid
Created 4000 atoms
  create_atoms CPU = 0.000918659 secs
Neighbor list info ...
  update every 20 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 1.4, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut, perpetual
      attributes: half, newton on
      pair build: half/bin/atomonly/newton
      stencil: half/bin/3d/newton
      bin: standard
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.706 | 2.706 | 2.706 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0            3   -6.7733681            0   -2.2744931   -3.7033504 
      50    1.6754119   -4.7947589            0   -2.2822693    5.6615925 
     100    1.6503357    -4.756014            0   -2.2811293    5.8050524 
     150    1.6596605   -4.7699432            0   -2.2810749    5.7830138 
     200    1.6371874   -4.7365462            0   -2.2813789    5.9246674 
     250    1.6323462   -4.7292021            0   -2.2812949    5.9762238 
Loop time of 0.34736 on 4 procs for 250 steps with 4000 atoms

Performance: 310916.549 tau/day, 719.714 timesteps/s
92.4% CPU use with 4 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.22404    | 0.23568    | 0.25712    |   2.6 | 67.85
Neigh   | 0.027579   | 0.028371   | 0.029572   |   0.5 |  8.17
Comm    | 0.04958    | 0.072869   | 0.084434   |   5.1 | 20.98
Output  | 0.0003319  | 0.00036758 | 0.00042612 |   0.0 |  0.11
Modify  | 0.0057379  | 0.0059653  | 0.006437   |   0.4 |  1.72
Other   |            | 0.004107   |            |       |  1.18

Nlocal:    1000 ave 1010 max 982 min
Histogram: 1 0 0 0 0 0 1 0 0 2
Nghost:    2703.75 ave 2713 max 2689 min
Histogram: 1 0 0 0 0 0 0 2 0 1
Neighs:    37915.5 ave 39239 max 36193 min
Histogram: 1 0 0 0 0 1 1 0 0 1

Total # of neighbors = 151662
Ave neighs/atom = 37.9155
Neighbor list builds = 12
Dangerous builds not checked
Total wall time: 0:00:00
milias@login.grid.umb.sk:~/Work/open-collection/theoretical_chemistry/software_runs/lammps/runs/melt/.mpirun --version
mpirun (Open MPI) 4.0.1

Report bugs to http://www.open-mpi.org/community/help/
milias@login.grid.umb.sk:~/Work/open-collection/theoretical_chemistry/software_runs/lammps/runs/melt/.mpiCC --version
g++ (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
rhc54 commented 5 years ago

Set PMIX_MCA_gds=hash in your environment - that should fix the problem.

miroi commented 5 years ago

Yes, this helped ! Many thanks, I am closing this issue as SOLVED.