Jobs failed via submitting in pyiron but no errors via cmd line

yuyuanjingxuan commented 3 years ago

Summary Hi all,

I find all my lammps jobs are crashed for unknown reasons. These jobs are submitted via pyiron. I have tested them in a command line, and there is no issue. I also submitted a finished job for a test. But it outputs the same error. Does anyone meet the same issue as me?

pyiron Version and Platform python 3.7.9 pyiron 0.4.4

Actual Behavior Attached are the error messages:

2021-09-30 13:13:32,172 - pyiron_log - INFO - job: water_slow id: 15939411, status: submitted, run job (modal)
2021-09-30 13:13:32,285 - pyiron_log - WARNING - Job aborted
2021-09-30 13:13:32,285 - pyiron_log - WARNING - Job aborted
2021-09-30 13:13:32,285 - pyiron_log - WARNING - Job aborted
2021-09-30 13:13:32,285 - pyiron_log - WARNING - 
2021-09-30 13:13:32,285 - pyiron_log - WARNING - 
2021-09-30 13:13:32,285 - pyiron_log - WARNING - 
Traceback (most recent call last):
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 735, in run_static
    universal_newlines=True,
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__main__.py", line 2, in <module>
    main()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__init__.py", line 61, in main
    args.cli(args)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/wrapper.py", line 39, in main
    submit_on_remote=args.submit
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 158, in job_wrapper_function
    job.run()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 117, in run
    self.job.run_static()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 756, in run_static
    raise RuntimeError("Job aborted")
RuntimeError: Job aborted

Steps to Reproduce Attached is the script of the test job:

import numpy as np
from pyiron.project import Project

pr = Project("tip3p_water")

density = 1.0e-24  # g/A^3
n_mols = 27
mol_mass_water = 18.015 # g/mol
# Determining the supercell size size
mass = mol_mass_water * n_mols / units.mol  # g
vol_h2o = mass / density # in A^3
a = vol_h2o ** (1./3.) # A
# Constructing the unitcell
n = int(round(n_mols ** (1. / 3.)))

dx = 0.7
r_O = [0, 0, 0]
r_H1 = [dx, dx, 0]
r_H2 = [-dx, dx, 0]
unit_cell = (a / n) * np.eye(3)
water = pr.create_atoms(elements=['H', 'H', 'O'], 
                        positions=[r_H1, r_H2, r_O], 
                        cell=unit_cell)
water.set_repeat([n, n, n])

job_name = "water_slow"
ham = pr.create_job("Lammps", job_name)
ham.structure = water
ham.potential = 'H2O_tip3p'
ham.calc_md(temperature=300, 
            n_ionic_steps=1e4, 
            time_step=0.01)
ham.executable.version = '2019.06.05_default_mpi'
ham.server.cores = 2
ham.server.queue = 'cm'
ham.run()

niklassiemer commented 3 years ago

Hi @yuyuanjingxuan

subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?

yuyuanjingxuan commented 3 years ago

Hi @yuyuanjingxuan

subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?

Yes, I tried that. Attached are the codes in run_lammps_2019.06.05_default_mpi.sh:

#!/bin/bash                                                                                                                                                     
module purge                                                                                                                                                    
module load pyiron/dev                                                                                                                                          
mpiexec -n $1 lmp_mpi -in control.inp

Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used mpiexec -n 2 lmp_mpi -in control.inp instead, and the job could run without any issues.

max-hassani commented 3 years ago

@yuyuanjingxuan, do you use a custom build of lammps, instead of the one installed via conda?

yuyuanjingxuan commented 3 years ago

@yuyuanjingxuan, do you use a custom build of lammps, instead of the one installed via conda?

No, the version I use is in the /u/system/SLES12/soft/pyiron/dev/anaconda3/bin/, which should be installed via conda.

yuyuanjingxuan commented 3 years ago

In reality, I do not change the local environment at all. These jobs were submitted one week ago. I was confused that the jobs running last week exported no error, while those running yesterday failed. None of them survived.

max-hassani commented 3 years ago

@yuyuanjingxuan , I don't know if this is the cause of the issue, but it seems that you have changed the path to pyiron resources in your ~/.pyiron file . Please make sure that RESOURCE_PATHS = /u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc

niklassiemer commented 3 years ago

Hi @yuyuanjingxuan

subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?

Yes, I tried that. Attached are the codes in run_lammps_2019.06.05_default_mpi.sh:
#!/bin/bash                                                                                                                                                     
module purge                                                                                                                                                    
module load pyiron/dev                                                                                                                                          
mpiexec -n $1 lmp_mpi -in control.inp 
Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used mpiexec -n 2 lmp_mpi -in control.inp instead, and the job could run without any issues.

In the global resources we have an additional option to the run command. We use

mpiexec -n $1 --oversubscribe lmp_mpi -in control.inp;

yuyuanjingxuan commented 3 years ago

Hi @yuyuanjingxuan

subprocess.CalledProcessError: Command '['/u/zhewa/packages/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

You seem to be using your own resources and, thus, run scripts. Could you try to run your calculation via this run script / compare it to the global version?

Yes, I tried that. Attached are the codes in run_lammps_2019.06.05_default_mpi.sh:
#!/bin/bash                                                                                                                                                     
module purge                                                                                                                                                    
module load pyiron/dev                                                                                                                                          
mpiexec -n $1 lmp_mpi -in control.inp 
Sorry for the misunderstanding. What cmd line means running the codes in this script. For the last line, I used mpiexec -n 2 lmp_mpi -in control.inp instead, and the job could run without any issues.
In the global resources we have an additional option to the run command. We use
mpiexec -n $1 --oversubscribe lmp_mpi -in control.inp;
I tried to add --oversubscribe in the run_lammps_2019.06.05_default_mpi.sh, and it still outputs the same error messages. The job could run by typing run_lammps_2019.06.05_default_mpi.sh, while it failed when I run run_queue.sh in the local directory. run_queue.sh:
#!/bin/bash
#SBATCH --partition=s.cmfe                                        
#SBATCH --ntasks=2                                                
#SBATCH --constraint='[swi1|swi1|swi2|swi3|swi4|swi5|swi6|swi7|swi8|swi9|swe1|swe2|swe3|swe4|swe5|swe6|swe7]'
#SBATCH --time=5760                                               
#SBATCH --mem-per-cpu=3GB                                         
#SBATCH --output=time.out                                         
#SBATCH --error=error.out                                         
#SBATCH --job-name=pi_job-id                                 
#SBATCH --chdir=path-of-job
#SBATCH --get-user-env=L              
python -m pyiron_base.cli wrapper -p path-of-job -j job-id
Is it the problem of run_queue.sh?

yuyuanjingxuan commented 3 years ago

I also tried to run my job by global version. Before running, I updated my resources:

Close the jupyter server and logout
Disconnect the ssh connection and then reconnect with ssh

Access the jupyterhub again module load pyiron/dev has been added in the.bashrc. But I still got a similar error message:

subprocess.CalledProcessError: Command '['/u/system/SLES12/soft/pyiron/dev/anaconda3/share/pyiron/lammps/bin/run_lammps_2020.03.03_mpi.sh', '2', '1']' returned non-zero exit status 255.

yuyuanjingxuan commented 3 years ago

We found where the issue comes from. It is still a module environment problem. I unload all local module and keep the submission scripts the same as the global version. Then everything works fine.

niklassiemer commented 3 years ago

I am after all curious how the other modules killed the workflow? However I am glad that the things work again!

pmrv commented 3 years ago

Agree with @niklassiemer, if this was an issue that the submission or resource scripts didn't purge the module environment thoroughly enough, we should change them to be more robust.

@yuyuanjingxuan Can you detail the exact steps you took to fix the issue?

yuyuanjingxuan commented 3 years ago

Yes, sure. @pmrv Attached are the steps how I fix the issue.

Troubleshoot the input pyiron scripts
Check the executable script (run_lammps_2020.03.03_mpi.sh)
Check the submission script (run_queue.sh)
If there are no error in the three steps above, submitted the job again via pyrion and check if the error messages still exist.

P.S. The issue is not fully addressed. I recently found there was no issue for the serial version of lammps. For the parallel version (with only two cores), it works if I do not set queue, while it still exports the same errors if I set queue='cm'.

yuyuanjingxuan commented 3 years ago

I sent the examples to lammps users in the group, and I was not the only one who met the issue. The jobs by parallel version of lammps submitted on cm queue always crashed. Attached are the error messages.

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
...
...

2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,257 - pyiron_log - WARNING - Job aborted
2021-10-07 09:48:40,258 - pyiron_log - WARNING - *** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cmti148:39772] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[cmti148:39771] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
...
...

Traceback (most recent call last):
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 735, in run_static
    universal_newlines=True,
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_default_mpi.sh', '2', '1']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__main__.py", line 2, in <module>
    main()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/__init__.py", line 61, in main
    args.cli(args)
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/cli/wrapper.py", line 39, in main
    submit_on_remote=args.submit
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 158, in job_wrapper_function
    job.run()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/wrapper.py", line 117, in run
    self.job.run_static()
  File "/u/system/SLES12/soft/pyiron/dev/anaconda3/lib/python3.7/site-packages/pyiron_base/job/generic.py", line 756, in run_static
    raise RuntimeError("Job aborted")
RuntimeError: Job aborted
[cmti148:39742] 1 more process has sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[cmti148:39742] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cmti148:39742] 1 more process has sent help message help-orte-runtime / orte_init:startup:internal-failure
[cmti148:39742] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
slurmstepd: error: *** JOB 2455220 ON cmti148 CANCELLED AT 2021-10-07T09:50:44 DUE TO TIME LIMIT ***

Current versions of lammps are updated on Aug 15. We speculate some libs lammps relays on might charge recently. Would it be the possible reason?

niklassiemer commented 3 years ago

It could be related to e.g. the mpi library (the executable dies in the MPI init), however, the jobs do run without pyiron? Could you quickly check which of the following cases are working?

The job runs

on the queue
- [ ] normally with pyiron
- [ ] if submitted to the queue via the pyiron-generated run-script
- [ ] with an own run-script using only pyiron/dev module and the provided lmp_mpi executable.
- [ ] with an own run-script using custom modules and the lmp_mpi from pyiron/dev
- [ ] with an own run-script using custom modules and an own lammps version
locally (e.g. in a notebook without specifying the queue/ on the command line)
- [x] normally with pyiron
- [x] using the pyiron-generated run-script
- [x] with an own run-script using only pyiron/dev module and the provided lmp_mpi executable.

pmrv commented 3 years ago

I think I'm seeing the same problem now and will try to investigate.

pmrv commented 3 years ago

I my test case it even fails when run with run_mode='manual' and then trying to run the job from the CLI. Copying the global lammps resource script into the very same folder and running it works tough...

pmrv commented 3 years ago

Serial jobs run both on the queue and on the login node, so it must be something with the MPI libraries.

pmrv commented 2 years ago

Short summary of my today's findings:

:ok: running serial on login node via pyiron
:x: running parallel on login node via pyiron
:ok: running parallel on login node via pyiron resource script

I drilled a little bit into how pyiron launches the script

out = subprocess.check_output(
                     [
                         self.executable.executable_path,
                         str(self.server.cores),
                         str(self.server.threads),
                     ],
                     cwd=self.project_hdf5.working_directory,
                     shell=False,
                     stderr=subprocess.STDOUT,
                     universal_newlines=True,
                )

Interestingly when I run this exact snippet in a normal python shell in the working directory of the job (with the variables substituted) it runs without problems. Yet when I run the pyiron job wrapper under a debugger, i.e.

python -m pdb -m pyiron_base.cli wrapper -p /cmmc/u/zora/scratch/lmp_mpi/test -j 15953741

set a breakpoint on GenericJob.run_static and then try to run the above snippet from within the debugger it fails with exit code 1.

I also tried to do subprocess.run(['bash']) and then type in the pyiron resource script with the same results as above, i.e. it works from a normal python shell in the job directory, but fails from within the debugger at run_static.

I guess this means pyiron at some points messes with the environment, which then confuses OpenMPI.

pmrv commented 2 years ago

I also tested that this problem does not occur with the intel MPI provided on the cluster. The problem there is just that the cluster built lammps binary does not have all the plugins that the conda version has.

pmrv commented 2 years ago

So @yuyuanjingxuan and I did some more debugging and in particular we found that (manually) submitting this run script to the queue works

#!/bin/bash
#SBATCH --partition=s.cmfe
#SBATCH --ntasks=2
#SBATCH --constraint='[swi1|swi1|swi2|swi3|swi4|swi5|swi6|swi7|swi8|swi9|swe1|swe2|swe3|swe4|swe5|swe6|swe7]'
#SBATCH --time=5760
#SBATCH --mem-per-cpu=3GB
#SBATCH --output=time.out
#SBATCH --error=error.out
#SBATCH --job-name=pi_15954124
#SBATCH --chdir=/u/zora/scratch/lmp_mpi/test_manual
#SBATCH --get-user-env=L

pwd;
echo Hostname: `hostname`
echo Date: `date`
echo JobID: $SLURM_JOB_ID

/u/system/SLES12/soft/pyiron/dev/pyiron-resources-cmmc/lammps/bin/run_lammps_2019.06.05_mpi.sh 2 1

the only change to the normal run script is the last line, which normally reads

python -m pyiron_base.cli wrapper -p /cmmc/u/zora/scratch/lmp_mpi/test_hdf5/test -j 15954132

Now the only thing that the job wrapper is supposed to be doing is calling GenericJob.run_static which in turns mostly does

https://github.com/pyiron/pyiron_base/blob/4e70a2d0b82ad9c2630cf7b459efd19a8416596b/pyiron_base/job/generic.py#L726

Yet the former works, but the latter not. I therefore think that somewhere pyiron must be tinkering with the environment which then it turn makes OpenMPI choke.

@jan-janssen Any idea where such environment modification could take place? (Note that it can't be an issue of mpiexec vs. srun, like we suspected on Monday, since it also happens on the login node)

pmrv commented 2 years ago

I built a small hello world MPI program with mpicc coming from the openmpi conda package and get the same results, i.e. crash when run under pyiron, but not otherwise. That means it can't be lammps.

pyiron / pyiron_atomistics

Jobs failed via submitting in pyiron but no errors via cmd line #377