radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

RP workflow on ARCHER gromacs/lsdmap for (0.3.14-27-g65bc062) #240

Closed ebreitmo closed 8 years ago

ebreitmo commented 8 years ago

Hi,

I started with a clean virtualenv, did

pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@master#egg=radical.ensemblemd 

got the latest grls-on-archer.tar.gz

python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.14-27-g65bc062)                                                
================================================================================

Starting Allocation                                                           ok
Verifying pattern                                                             ok
Starting pattern execution                                                    ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for 8 simulation tasks: custom.gromacs to complete     done
Iteration 1: Waiting for analysis tasks: custom.pre_lsdmap to complete2016-02-04 09:26:25,515: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-4       : ERROR   : ComputeUnit error: STDERR: [... CONTENT SHORTENED ...]
ute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx trjconv, VERSION 5.1
Executable:   /work/y07/y07/gmx/5.1-phase2/bin/gmx
Data prefix:  /work/y07/y07/gmx/5.1-phase2
Command line:
  gmx trjconv -f tmp.gro -s tmp.gro -o tmpha.gro

Will write gro: Coordinate file in Gromos-87 format

-------------------------------------------------------
Program gmx trjconv, VERSION 5.1
Source code file: /work/y07/y07/gmx/5.1-phase2/source/src/gromacs/fileio/confio.c, line: 907

Fatal error:
gro file does not have the number of atoms on the second line
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

, STDOUT: [... CONTENT SHORTENED ...]
ute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx trjconv, VERSION 5.1
Executable:   /work/y07/y07/gmx/5.1-phase2/bin/gmx
Data prefix:  /work/y07/y07/gmx/5.1-phase2
Command line:
  gmx trjconv -f tmp.gro -s tmp.gro -o tmpha.gro

Will write gro: Coordinate file in Gromos-87 format

-------------------------------------------------------
Program gmx trjconv, VERSION 5.1
Source code file: /work/y07/y07/gmx/5.1-phase2/source/src/gromacs/fileio/confio.c, line: 907

Fatal error:
gro file does not have the number of atoms on the second line
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

2016-02-04 09:26:25,515: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-4       : ERROR   : Pattern execution FAILED.
2016-02-04 09:26:25,515: radical.pilot       : MainProcess                     : Thread-4       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 262, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 199, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 318, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 603, in execute_pattern
    resource._umgr.wait_units(uids)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 698, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

Starting Deallocation..
2016-02-04 09:26:36,007: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: 
2016-02-04 09:26:36,007: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-02-04 09:26:36,007: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
Traceback (most recent call last):
  File "extasy_gromacs_lsdmap.py", line 271, in <module>
    cluster.deallocate()
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 117, in deallocate
    self._session.close(cleanup=self._cleanup)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/session.py", line 304, in close
    pmgr.close (terminate=terminate)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 175, in close
    self.cancel_pilots()
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 579, in cancel_pilots
    self._worker.register_cancel_pilots_request(pilot_ids=pilot_ids)
  File "/Users/elenabreitmoser/04Feb/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 608, in register_cancel_pilots_request
    time.sleep(0.3)
KeyboardInterrupt

On ARCHER

more unit.000009/STDERR

                   :-) GROMACS - gmx trjconv, VERSION 5.1 (-:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar   
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra   Sebastian Fritsch 
  Gerrit Groenhof   Christoph Junghans   Anca Hamuraru    Vincent Hindriksen
 Dimitrios Karkoulis    Peter Kasson        Jiri Kraus      Carsten Kutzner  
    Per Larsson      Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff 
   Erik Marklund      Teemu Murtola       Szilard Pall       Sander Pronk   
   Roland Schulz     Alexey Shvetsov     Michael Shirts     Alfons Sijbers  
   Peter Tieleman    Teemu Virolainen  Christian Wennberg    Maarten Wolf   
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx trjconv, VERSION 5.1
Executable:   /work/y07/y07/gmx/5.1-phase2/bin/gmx
Data prefix:  /work/y07/y07/gmx/5.1-phase2
Command line:
  gmx trjconv -f tmp.gro -s tmp.gro -o tmpha.gro

Will write gro: Coordinate file in Gromos-87 format

-------------------------------------------------------
Program gmx trjconv, VERSION 5.1
Source code file: /work/y07/y07/gmx/5.1-phase2/source/src/gromacs/fileio/confio.c, line: 907

Fatal error:
gro file does not have the number of atoms on the second line
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
vivek-bala commented 8 years ago

Could you check the output of the simulation units (1-8) and post the contents of the shell script in any of those folders ?

ebreitmo commented 8 years ago
ls -lrt unit.000001
total 1048
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start0.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rw------- 1 ebreitmo e290     796 Feb  4 09:25 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:25 core
-rw------- 1 ebreitmo e290      95 Feb  4 09:25 STDOUT
-rw------- 1 ebreitmo e290   11636 Feb  4 09:25 STDERR
more unit.000001/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000001
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000001
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000002
total 1260
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start1.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rw------- 1 ebreitmo e290     796 Feb  4 09:25 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:25 core
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.9#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.8#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.7#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.6#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.5#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.4#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.3#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.2#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.10#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.1#
-rw------- 1 ebreitmo e290     563 Feb  4 09:25 STDOUT
-rw------- 1 ebreitmo e290   66111 Feb  4 09:25 STDERR
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.12#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.11#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 mdout.mdp

more unit.000002/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000002
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000002
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000003
total 1268
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start2.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
-rw------- 1 ebreitmo e290     796 Feb  4 09:26 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:26 core
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.9#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.8#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.7#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.6#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.5#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.4#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.3#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.2#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #topol.tpr.4#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #topol.tpr.3#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #topol.tpr.2#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #topol.tpr.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 topol.tpr
-rw------- 1 ebreitmo e290    2663 Feb  4 09:26 STDOUT
-rw------- 1 ebreitmo e290   70396 Feb  4 09:26 STDERR
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.12#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.11#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.10#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 mdout.mdp
-rw------- 1 ebreitmo e290    3312 Feb  4 09:26 md.log
more unit.000003/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000003
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000003
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000004
total 1096
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start3.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rw------- 1 ebreitmo e290     796 Feb  4 09:25 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:25 core
-rw------- 1 ebreitmo e290      95 Feb  4 09:25 STDOUT
-rw------- 1 ebreitmo e290   61366 Feb  4 09:25 STDERR
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 #mdout.mdp.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 mdout.mdp
more unit.000004/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000004
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000004
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000005/
total 1476
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start4.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
-rw------- 1 ebreitmo e290     796 Feb  4 09:26 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 out.gro
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.1#
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:26 core
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #traj.trr.5#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #traj.trr.4#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #traj.trr.3#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #traj.trr.2#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #traj.trr.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 traj.trr
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.7#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.6#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.5#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.4#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.3#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.2#
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.1#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.9#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.8#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.7#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.6#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.5#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.4#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.3#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.2#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.11#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.10#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.8#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.7#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.6#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.5#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.4#
-rw------- 1 ebreitmo e290    6509 Feb  4 09:26 #md.log.3#
-rw------- 1 ebreitmo e290   13463 Feb  4 09:26 #md.log.2#
-rw------- 1 ebreitmo e290    6509 Feb  4 09:26 #md.log.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #ener.edr.5#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #ener.edr.4#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #ener.edr.3#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #ener.edr.2#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 #ener.edr.1#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 ener.edr
-rw------- 1 ebreitmo e290    9600 Feb  4 09:26 #topol.tpr.8#
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 topol.tpr
-rw------- 1 ebreitmo e290    4307 Feb  4 09:26 STDOUT
-rw------- 1 ebreitmo e290   70059 Feb  4 09:26 STDERR
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 mdout.mdp
-rw------- 1 ebreitmo e290    6509 Feb  4 09:26 #md.log.9#
-rw------- 1 ebreitmo e290    3312 Feb  4 09:26 md.log

more unit.000005/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000005
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000005
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000006/
total 1048
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start5.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rw------- 1 ebreitmo e290     796 Feb  4 09:25 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:25 core
-rw------- 1 ebreitmo e290   10742 Feb  4 09:25 STDERR
-rw------- 1 ebreitmo e290      95 Feb  4 09:25 STDOUT
more unit.000006/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000006
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000006
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000007/
total 1240
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start6.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rw------- 1 ebreitmo e290     796 Feb  4 09:25 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:25 out.gro
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:25 core
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.9#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.8#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.7#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.6#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.5#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.4#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.3#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.2#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.10#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.1#
-rw------- 1 ebreitmo e290     527 Feb  4 09:25 STDOUT
-rw------- 1 ebreitmo e290   60642 Feb  4 09:25 STDERR
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 #mdout.mdp.11#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:25 mdout.mdp

more unit.000007/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000007
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000007
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

ls -lrt unit.000008
total 1244
lrwxrwxrwx 1 ebreitmo e290     139 Feb  4 09:25 topol.top -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/topol.top
lrwxrwxrwx 1 ebreitmo e290     145 Feb  4 09:25 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000000/temp/start7.gro
lrwxrwxrwx 1 ebreitmo e290     136 Feb  4 09:25 run.py -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/run.py
lrwxrwxrwx 1 ebreitmo e290     140 Feb  4 09:25 grompp.mdp -> /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/staging_area/grompp.mdp
-rwx------ 1 ebreitmo e290     763 Feb  4 09:25 radical_pilot_cu_launch_script.sh
-rw------- 1 ebreitmo e290     796 Feb  4 09:26 run.sh
-rw------- 1 ebreitmo e290       0 Feb  4 09:26 out.gro
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.7#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.6#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.5#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.4#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.3#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.2#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.1#
-rw------- 1 ebreitmo e290 1470464 Feb  4 09:26 core
-rw------- 1 ebreitmo e290   63445 Feb  4 09:26 STDERR
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.9#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.8#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.11#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 #mdout.mdp.10#
-rw------- 1 ebreitmo e290   11658 Feb  4 09:26 mdout.mdp
-rw------- 1 ebreitmo e290     527 Feb  4 09:26 STDOUT

more unit.000008/radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000-pilot.0000/unit.000008
# Pre-exec commands
module load packages-archer
module load gromacs
module load python-compute/2.7.6
# Environment variables
export RP_SESSION_ID=rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016835.0000 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingC
omponent.0.child RP_UNIT_ID=unit.000008
# The command to run
/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin/aprun -n 1 python "run.py" "--mdp" "grompp.mdp" "--gro" "start.gro" "--top" "topol.top" "--out" "out.gro" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL
vivek-bala commented 8 years ago

Is the out.gro file in unit.000004 and unit.000006 empty ?

ebreitmo commented 8 years ago

They both are empty!

vivek-bala commented 8 years ago

Ok.. I believe this is the same as #226 .

ibethune commented 8 years ago

OK I did some digging into this (since I can also recreate it, sometimes).

The root cause of the failures is not in the pre_lsdmap CU, but in the either gromacs CUs. pre_lsdmap only fails if the out*.gro files linked from the gromacs CUs are all empty. In my testing I saw various units failing: e.g.

e290ib@eslogin008:/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.mbp-ib.epcc.ed.ac.uk.ibethune.016841.0001-pilot.0000/unit.000009> wc -l out*gro
    0 out0.gro
  200 out1.gro
    0 out2.gro
    0 out3.gro
   75 out4.gro
    0 out5.gro
    0 out6.gro
  150 out7.gro
  425 total

So first thing is that there is an error-detection issue here. The CUs that produce no output are failing and we should be failing those CUs rather than waiting for downstream CUs to fail.... This is a bug in the run.py and run.sh scripts which do not capture the return codes from gromacs. Please fix!

Second thing is what is causing gromacs to fail in the first place?

I have pasted the STDERR from one of the failing CUs here: https://gist.github.com/ibethune/5a1ee869e0e0356ac3ff

This appears to be the same thing that was raised in issue #238 - so we can either re-open that one, or track it here, I don't mind.

Not sure of the root cause to that, but I note that attempting to load MPI in an interactive environment with the python-compute module won't work in interactive mode. MPI can only be initialised inside a call to aprun i.e. the parallel environment has been created:

$ module load python-compute/2.7.6
$ python
Python 2.7.6 (default, Mar 10 2014, 14:13:45) 
[GCC 4.8.1 20130531 (Cray Inc.)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from mpi4py import MPI
[Wed Feb  3 20:35:39 2016] [unknown] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(506): 
MPID_Init(192).......: channel initialization failed
MPID_Init(569).......:  PMI2 init failed: 1
vivek-bala commented 8 years ago

run.sh: https://gist.github.com/vivek-bala/450e248aac1bb1672a30 pbs script: https://gist.github.com/vivek-bala/b20e3b93548e7fccb2b9

I can recreate the problem with just the pbs script. Running "run.sh" from the pbs script produces that error at the mdrun command. Although, "/bin/bash run.sh" is successful if run from the command line. The non-mpi mdrun executable is being used in all cases.

vivek-bala commented 8 years ago

Running the gromacs commands directly using aprun works: https://gist.github.com/vivek-bala/1d6ac234883f120ce33c.

vivek-bala commented 8 years ago

Error from the first method:

                            GROMACS is written by:
     Emile Apol      Rossen Apostolov  Herman J.C. Berendsen    Par Bjelkmar
 Aldert van Buuren   Rudi van Drunen     Anton Feenstra   Sebastian Fritsch
  Gerrit Groenhof   Christoph Junghans   Anca Hamuraru    Vincent Hindriksen
 Dimitrios Karkoulis    Peter Kasson        Jiri Kraus      Carsten Kutzner
    Per Larsson      Justin A. Lemkul   Magnus Lundborg   Pieter Meulenhoff
   Erik Marklund      Teemu Murtola       Szilard Pall       Sander Pronk
   Roland Schulz     Alexey Shvetsov     Michael Shirts     Alfons Sijbers
   Peter Tieleman    Teemu Virolainen  Christian Wennberg    Maarten Wolf
                           and the project leaders:
        Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel

Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2015, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.

GROMACS is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License
as published by the Free Software Foundation; either version 2.1
of the License, or (at your option) any later version.

GROMACS:      gmx mdrun, VERSION 5.1
Executable:   /work/y07/y07/gmx/5.1-phase2/bin/gmx
Data prefix:  /work/y07/y07/gmx/5.1-phase2
Command line:
  gmx mdrun -nt 1 -s topol.tpr -o traj.trr -e ener.edr

-------------------------------------------------------
Program:     gmx mdrun, VERSION 5.1
Source file: src/gromacs/commandline/cmdlineparser.cpp (line 234)
Function:    void gmx::CommandLineParser::parse(int*, char**)

Error in user input:
Invalid command-line options
  In command-line option -s
    File 'topol.tpr' does not exist or is not accessible.

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
cat: confout.gro: No such file or directory
Replacing old mdp entry 'nstxtcout' by 'nstxout-compressed'
Replacing old mdp entry 'xtc_grps' by 'compressed-x-grps'
Setting the LD random seed to 1379599184
Wed Feb 10 16:46:54 2016: [unset]:_pmi_alps_sync:alps response not OKAY
Wed Feb 10 16:46:54 2016: [unset]:_pmiu_daemon:_pmi_alps_sync failed
run.sh: line 15: 20418 Aborted                 gmx grompp -f grompp.mdp -c $tmpstartgro -p topol.top -o topol.tpr
Wed Feb 10 16:46:54 2016: [PE_0]:_pmi_daemon_barrier:PE pipe read failed from daemon errno = Success
Wed Feb 10 16:46:54 2016: [PE_0]:_pmi_init:_pmi_daemon_barrier returned -1
ibethune commented 8 years ago

FYI, I am in touch with the Cray/ARCHER team, and we're looking into this.

ibethune commented 8 years ago

Root cause is some stray PMI libraries linked in to the 'serial' gmx binary. I have installed a fixed build, but still waiting for the central install to be updated. For now, you can replace in kernel_defs/gromacs.py:

module load gromacs with export PATH=$PATH:/work/z01/shared/gromacs-5.1.2/bin

Will update the ticket when a final solution is in place.

vivek-bala commented 8 years ago

Outdated.