Open yuyuanjingxuan opened 3 years ago
Hi @yuyuanjingxuan
subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.
You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?
Hi @yuyuanjingxuan
subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.
You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?
Yes, I tried that. Attached are the codes in run_lammps_2019.06.05_default_mpi.sh
:
#!/bin/bash
module purge
module load pyiron/dev
mpiexec -n $1 lmp_mpi -in control.inp
Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used mpiexec -n 2 lmp_mpi -in control.inp
instead, and the job could run without any issues.
@yuyuanjingxuan, do you use a custom build of lammps, instead of the one installed via conda?
@yuyuanjingxuan, do you use a custom build of lammps, instead of the one installed via conda?
No, the version I use is in the /u/system/SLES12/soft/pyiron/dev/anaconda3/bin/
, which should be installed via conda.
In reality, I do not change the local environment at all. These jobs were submitted one week ago. I was confused that the jobs running last week exported no error, while those running yesterday failed. None of them survived.
@yuyuanjingxuan , I don't know if this is the cause of the issue, but it seems that you have changed the path to pyiron resources in your ~/.pyiron
file . Please make sure that RESOURCE_PATHS = /u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc
Hi @yuyuanjingxuan
subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.
You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?
Yes, I tried that. Attached are the codes in
run_lammps_2019.06.05_default_mpi.sh
:#!/bin/bash module purge module load pyiron/dev mpiexec -n $1 lmp_mpi -in control.inp
Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used
mpiexec -n 2 lmp_mpi -in control.inp
instead, and the job could run without any issues.
In the global resources we have an additional option to the run command. We use
mpiexec -n $1 --oversubscribe lmp_mpi -in control.inp;
Hi @yuyuanjingxuan
subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.
You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?
Yes, I tried that. Attached are the codes in
run_lammps_2019.06.05_default_mpi.sh
:#!/bin/bash module purge module load pyiron/dev mpiexec -n $1 lmp_mpi -in control.inp
Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used
mpiexec -n 2 lmp_mpi -in control.inp
instead, and the job could run without any issues.In the global resources we have an additional option to the run command. We use
mpiexec -n $1 --oversubscribe lmp_mpi -in control.inp;
I tried to add
--oversubscribe
in therun_lammps_2019.06.05_default_mpi.sh
, and it still outputs the same error messages. The job could run by typingrun_lammps_2019.06.05_default_mpi.sh
, while it failed when I runrun_queue.sh
in the local directory.run_queue.sh
:#!/bin/bash #SBATCH --partition=s.cmfe #SBATCH --ntasks=2 #SBATCH --constraint='[swi1|swi1|swi2|swi3|swi4|swi5|swi6|swi7|swi8|swi9|swe1|swe2|swe3|swe4|swe5|swe6|swe7]' #SBATCH --time=5760 #SBATCH --mem-per-cpu=3GB #SBATCH --output=time.out #SBATCH --error=error.out #SBATCH --job-name=pi_job-id #SBATCH --chdir=path-of-job #SBATCH --get-user-env=L python -m pyiron_base.cli wrapper -p path-of-job -j job-id
Is it the problem of
run_queue.sh
?
I also tried to run my job by global version. Before running, I updated my resources:
module load pyiron/dev
has been added in the.bashrc
. But I still got a similar error message:
subprocess.CalledProcessError: Command '['/u/system/SLES12/soft/pyiron/dev/anaconda3/share/pyiron/lammps/bin/run_lammps_2020.03.03_mpi.sh', '2', '1']' returned non-zero exit status 255.
We found where the issue comes from. It is still a module environment problem. I unload all local module and keep the submission scripts the same as the global version. Then everything works fine.
I am after all curious how the other modules killed the workflow? However I am glad that the things work again!
Agree with @niklassiemer, if this was an issue that the submission or resource scripts didn't purge the module environment thoroughly enough, we should change them to be more robust.
@yuyuanjingxuan Can you detail the exact steps you took to fix the issue?
Yes, sure. @pmrv Attached are the steps how I fix the issue.
P.S. The issue is not fully addressed. I recently found there was no issue for the serial version of lammps. For the parallel version (with only two cores), it works if I do not set queue
, while it still exports the same errors if I set queue='cm'
.
I sent the examples to lammps users in the group, and I was not the only one who met the issue. The jobs by parallel version of lammps submitted on cm queue always crashed. Attached are the error messages.
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
...
...
2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,258 - pyiron_log - WARNING - *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cmti148:39772] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[cmti148:39771] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
...
...
Traceback (most recent call last):
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 735, in run_static
universal_newlines=True,
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__main__.py", line 2, in <module>
main()
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__init__.py", line 61, in main
args.cli(args)
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/wrapper.py", line 39, in main
submit_on_remote=args.submit
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 158, in job_wrapper_function
job.run()
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 117, in run
self.job.run_static()
File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 756, in run_static
raise RuntimeError("Job aborted")
RuntimeError: Job aborted
[cmti148:39742] 1 more process has sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[cmti148:39742] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cmti148:39742] 1 more process has sent help message help-orte-runtime / orte_init:startup:internal-failure
[cmti148:39742] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
slurmstepd: error: *** JOB 2455220 ON cmti148 CANCELLED AT 2021-10-07T09:50:44 DUE TO TIME LIMIT ***
Current versions of lammps are updated on Aug 15. We speculate some libs lammps relays on might charge recently. Would it be the possible reason?
It could be related to e.g. the mpi library (the executable dies in the MPI init), however, the jobs do run without pyiron? Could you quickly check which of the following cases are working?
The job runs
pyiron/dev
module and the provided lmp_mpi
executable.lmp_mpi
from pyiron/dev
pyiron/dev
module and the provided lmp_mpi
executable.I think I'm seeing the same problem now and will try to investigate.
I my test case it even fails when run with run_mode='manual'
and then trying to run the job from the CLI. Copying the global lammps resource script into the very same folder and running it works tough...
Serial jobs run both on the queue and on the login node, so it must be something with the MPI libraries.
Short summary of my today's findings:
I drilled a little bit into how pyiron launches the script
out = subprocess.check_output(
[
self.executable.executable_path,
str(self.server.cores),
str(self.server.threads),
],
cwd=self.project_hdf5.working_directory,
shell=False,
stderr=subprocess.STDOUT,
universal_newlines=True,
)
Interestingly when I run this exact snippet in a normal python shell in the working directory of the job (with the variables substituted) it runs without problems. Yet when I run the pyiron job wrapper under a debugger, i.e.
python -m pdb -m pyiron_base.cli wrapper -p /cmmc/u/zora/scratch/lmp_mpi/test -j 15953741
set a breakpoint on GenericJob.run_static
and then try to run the above snippet from within the debugger it fails with exit code 1.
I also tried to do subprocess.run(['bash'])
and then type in the pyiron resource script with the same results as above, i.e. it works from a normal python shell in the job directory, but fails from within the debugger at run_static
.
I guess this means pyiron at some points messes with the environment, which then confuses OpenMPI.
I also tested that this problem does not occur with the intel MPI provided on the cluster. The problem there is just that the cluster built lammps binary does not have all the plugins that the conda version has.
So @yuyuanjingxuan and I did some more debugging and in particular we found that (manually) submitting this run script to the queue works
#!/bin/bash
#SBATCH --partition=s.cmfe
#SBATCH --ntasks=2
#SBATCH --constraint='[swi1|swi1|swi2|swi3|swi4|swi5|swi6|swi7|swi8|swi9|swe1|swe2|swe3|swe4|swe5|swe6|swe7]'
#SBATCH --time=5760
#SBATCH --mem-per-cpu=3GB
#SBATCH --output=time.out
#SBATCH --error=error.out
#SBATCH --job-name=pi_15954124
#SBATCH --chdir=/u/zora/scratch/lmp_mpi/test_manual
#SBATCH --get-user-env=L
pwd;
echo Hostname: `hostname`
echo Date: `date`
echo JobID: $SLURM_JOB_ID
/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_mpi.sh 2 1
the only change to the normal run script is the last line, which normally reads
python -m pyiron_base.cli wrapper -p /cmmc/u/zora/scratch/lmp_mpi/test_hdf5/test -j 15954132
Now the only thing that the job wrapper is supposed to be doing is calling GenericJob.run_static
which in turns mostly does
Yet the former works, but the latter not. I therefore think that somewhere pyiron must be tinkering with the environment which then it turn makes OpenMPI choke.
@jan-janssen Any idea where such environment modification could take place? (Note that it can't be an issue of mpiexec vs. srun, like we suspected on Monday, since it also happens on the login node)
I built a small hello world MPI program with mpicc
coming from the openmpi conda package and get the same results, i.e. crash when run under pyiron, but not otherwise. That means it can't be lammps.
Summary Hi all,
I find all my lammps jobs are crashed for unknown reasons. These jobs are submitted via pyiron. I have tested them in a command line, and there is no issue. I also submitted a finished job for a test. But it outputs the same error. Does anyone meet the same issue as me?
pyiron Version and Platform python 3.7.9 pyiron 0.4.4
Actual Behavior Attached are the error messages:
Steps to Reproduce Attached is the script of the test job: