open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

orte_init failure for parallel process #8017

Closed krl52 closed 10 months ago

krl52 commented 4 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.0.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Hi all, I'm having issues when running a python script in parallel. This is my input:

from ase import Atoms
from ase.optimize import BFGS
from ase.calculators.espresso import Espresso

atoms = Atoms('HOH',
             positions=[[0, 0, -1], [0, 1, 0], [0, 0, 1]])
atoms.center(vacuum=3.0)

ps = {      'H': 'H.pbe-kjpaw.UPF',
            'O': 'O.pbe-kjpaw.UPF',
}

calc = Espresso(pseudopotentials=ps,
tprnfor=True, kpts=(3,4,1), pseudo_dir = '$HOME/qe-6.5/pseudo') 

atoms.calc = calc

opt = BFGS(atoms, trajectory='opt.traj')
opt.run(fmax=0.05)

It runs fine in serial. However, this is the full message that is returned after running this script with mpiexec -np 2 python3 script.py:

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[MacBook-Pro.local:06020] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
rank=0 L00: Traceback (most recent call last):
rank=0 L01:   File "H2Ooptimization.py", line 30, in <module>
rank=0 L02:     opt.run(fmax=0.05)
rank=0 L03:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 275, in run
rank=0 L04:     return Dynamics.run(self)
rank=0 L05:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 162, in run
rank=0 L06:     for converged in Dynamics.irun(self):
rank=0 L07:   File $HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 128, in irun
rank=0 L08:     self.atoms.get_forces()
rank=0 L09:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/atoms.py", line 794, in get_forces
rank=0 L10:     forces = self._calc.get_forces(self)
rank=0 L11:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 699, in get_forces
rank=0 L12:     return self.get_property('forces', atoms)
rank=0 L13:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 738, in get_property
rank=0 L14:     self.calculate(atoms, [name], system_changes)
rank=0 L15:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 939, in calculate
rank=0 L16:     raise CalculationFailed(msg)
rank=0 L17: ase.calculators.calculator.CalculationFailed: Calculator "espresso" failed with command "pw.x -in espresso.pwi > espresso.pwo" failed in $HOME/Desktop with error code 1
GPAW CLEANUP (node 0): <class 'ase.calculators.calculator.CalculationFailed'> occurred.  Calling MPI_Abort!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[MacBook-Pro.local:06022] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
rank=1 L00: Traceback (most recent call last):
rank=1 L01:   File "H2Ooptimization.py", line 30, in <module>
rank=1 L02:     opt.run(fmax=0.05)
rank=1 L03:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 275, in run
rank=1 L04:     return Dynamics.run(self)
rank=1 L05:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 162, in run
rank=1 L06:     for converged in Dynamics.irun(self):
rank=1 L07:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/optimize/optimize.py", line 128, in irun
rank=1 L08:     self.atoms.get_forces()
rank=1 L09:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/atoms.py", line 794, in get_forces
rank=1 L10:     forces = self._calc.get_forces(self)
rank=1 L11:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 699, in get_forces
rank=1 L12:     return self.get_property('forces', atoms)
rank=1 L13:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 738, in get_property
rank=1 L14:     self.calculate(atoms, [name], system_changes)
rank=1 L15:   File "$HOME/Library/Python/3.8/lib/python/site-packages/ase/calculators/calculator.py", line 939, in calculate
rank=1 L16:     raise CalculationFailed(msg)
rank=1 L17: ase.calculators.calculator.CalculationFailed: Calculator "espresso" failed with command "pw.x -in espresso.pwi > espresso.pwo" failed in $HOME/Desktop with error code 1
GPAW CLEANUP (node 1): <class 'ase.calculators.calculator.CalculationFailed'> occurred.  Calling MPI_Abort!
[MacBook-Pro.local:06015] 1 more process has sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[MacBook-Pro.local:06015] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[MacBook-Pro.local:06015] 1 more process has sent help message help-orte-runtime / orte_init:startup:internal-failure
[MacBook-Pro.local:06015] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 42.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[MacBook-Pro:06015] *** Process received signal ***
[MacBook-Pro:06015] Signal: Segmentation fault: 11 (11)
[MacBook-Pro:06015] Signal code: Address not mapped (1)
[MacBook-Pro:06015] Failing at address: 0x30
[MacBook-Pro:06015] [ 0] 0   libsystem_platform.dylib            0x00007fff7174c5fd _sigtramp + 29
[MacBook-Pro:06015] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[MacBook-Pro:06015] [ 2] 0   mca_pmix_pmix3x.so                  0x000000010acce1f7 OPAL_MCA_PMIX3X_pmix_ptl_base_recv_handler + 1289
[MacBook-Pro:06015] [ 3] 0   libevent_core-2.1.7.dylib           0x000000010a2acbde event_process_active_single_queue + 1074
[MacBook-Pro:06015] [ 4] 0   libevent_core-2.1.7.dylib           0x000000010a2a9e21 event_base_loop + 1012
[MacBook-Pro:06015] [ 5] 0   mca_pmix_pmix3x.so                  0x000000010ac9c428 progress_engine + 30
[MacBook-Pro:06015] [ 6] 0   libsystem_pthread.dylib             0x00007fff71758109 _pthread_start + 148
[MacBook-Pro:06015] [ 7] 0   libsystem_pthread.dylib             0x00007fff71753b8b thread_start + 15
[MacBook-Pro:06015] *** End of error message ***
Segmentation fault: 11

I tried running export TMPDIR=/tmp and export PATH=/usr/local/Cellar/open-mpi/4.0.4_1/bin:$PATH but to no avail. Any suggestions would be appreciated!

GoddessLuBoYan commented 10 months ago

hi, now I have the same bug and how do you fix it, thx

jsquyres commented 10 months ago

@GoddessLuBoYan This issue is so old that it's probably quite stale (indeed, the version cited is quite old). Are you doing the same thing as the initial reporter in this issue (i.e., using Open MPI v4.0.4 to run a Python script in parallel that apparently doesn't use the MPI API at all)? If not, can you open a new issue and provide all diagnostic information that is requested?

GoddessLuBoYan commented 10 months ago

@GoddessLuBoYan This issue is so old that it's probably quite stale (indeed, the version cited is quite old). Are you doing the same thing as the initial reporter in this issue (i.e., using Open MPI v4.0.4 to run a Python script in parallel that apparently doesn't use the MPI API at all)? If not, can you open a new issue and provide all diagnostic information that is requested?

I have solved this problem that is NOT caused by OpenMPI. Thanks.

njzjz commented 7 months ago

I met the same problem and found this issue from google. Finally, I found the reason is that TMPDIR is inaccessible.

The reason for that is interesting: I used Open MPI with Slurm + Singularity. Slurm automatically set the environmental variable TMPDIR for me. Singularity automatically exposed this environmental variable to the container. However, this directory was not accessible in the container. I added --bind $TMPDIR to the Singularity command, which resolved the problem.