Closed ebreitmo closed 8 years ago
The title says RP 0.3.14, but that sounds like an EnsembleMD version. What version of RP is this? (i.e. whats the output of radicalpilot-version
?)
The content of the pilot sandbox on ARCHER would be helpful (/work/e290/e290/$USER/radical.pilot.sandbox/$RP_SESSION)
Hypothesizing about possible failure modes, you could try to re-run with 48 pilot cores.
a) radicalpilot-version 0.37.10
b) On ARCHER: ls -lrt total 1316 -rwxr-xr-x 1 ebreitmo e290 47400 Nov 24 12:01 bootstrap_1.sh -rw-r--r-- 1 ebreitmo e290 809859 Nov 24 12:01 saga-python-0.38.1.tar.gz -rw-r--r-- 1 ebreitmo e290 104469 Nov 24 12:01 radical.utils-0.38.tar.gz -rw-r--r-- 1 ebreitmo e290 230829 Nov 24 12:01 radical.pilot-0.37.10.tar.gz -rw------- 1 ebreitmo e290 2439 Nov 24 12:01 agent_0.cfg drwx--S--- 6 ebreitmo e290 4096 Nov 24 12:03 radical.utils-0.38 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:04 saga-python-0.38.1 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:04 radical.pilot-0.37.10 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:07 rp_install -rwxr-xr-x 1 ebreitmo e290 1692 Nov 24 12:09 bootstrap_2.sh -rw------- 1 ebreitmo e290 5 Nov 24 12:09 agent_0.bootstrap_2.out -rw------- 1 ebreitmo e290 0 Nov 24 12:09 agent_0.bootstrap_2.err -rw------- 1 ebreitmo e290 3633 Nov 24 12:09 agent_0.bootstrap_3.log -rw------- 1 ebreitmo e290 821 Nov 24 12:23 bootstrap_1.err -rw------- 1 ebreitmo e290 1401 Nov 24 12:23 agent_0.err -rw------- 1 ebreitmo e290 11196 Nov 24 12:23 agent_0.out -rw------- 1 ebreitmo e290 94134 Nov 24 12:23 bootstrap_1.out
c) python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
Starting Allocation ok Verifying pattern ok
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'
Job waiting on queue...
-rw------- 1 ebreitmo e290 11196 Nov 24 12:23 agent_0.out
Can you post the contents of this file?
PYTHONPATH: ['/fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot
.0000/rp_install/bin', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages/Cython-0.21.1-py2.7-linux-
x86_64.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/work/e29
0/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/pip-1.3-py2.7.egg', '/work/e290/e290/ebreitmo/radical.pilot.san
dbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000/rp_install/lib/python2.7/site-packages', '/fs4/e2
90/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000', '/work/y07/y0
7/cse/pycairo/1.10.0/lib/python2.7/site-packages', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages', '/work/y
07/y07/cse/pygtk/2.24.0/lib/python2.7/site-packages/gtk-2.0', '/work/y07/y07/cse/yaml/pyyaml/3.11/lib/python2.7/site-packages
', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages', '/work/y07/y07/cse/mpi4py/1.3.1/lib/python2.
7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py', '/usr/local/packages/cse/bolt/0.6/modules', '/work/y07/
y07/cse/pygobject/2.21.3/lib/python2.7/site-packages/gtk-2.0', '/work/e290/shared/shared_pilot_ve_20150924/lib/python27.zip',
'/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/plat-l
inux2', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/py
thon2.7/lib-old', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-dynload', '/work/y07/y07/cse/python/2.7.6/lib
/python2.7', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/plat-linux2', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/lib-tk
', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/p
y']
python: 2.7.6 (default, Mar 10 2014, 14:13:45)
[GCC 4.8.1 20130531 (Cray Inc.)]
utils : 0.38 : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pi
lot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.pyc
saga : 0.38.1 : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-p
ilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.pyc
pilot : 0.37.10 : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-
pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/__init__.pyc
type : multicore
gitid : $Id$
---------------------------------------------------------------------
startup agent agent_0 : /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.
0001-pilot.0000/agent_0.cfg
Agent config (/fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot
.0000/agent_0.cfg):
{'agent_launch_method': 'APRUN',
'agent_layout': {'agent_0': {'bridges': ['agent_staging_input_queue',
'agent_scheduling_queue',
'agent_executing_queue',
'agent_staging_output_queue',
'agent_unschedule_pubsub',
'agent_reschedule_pubsub',
'agent_command_pubsub',
'agent_state_pubsub'],
'components': {'AgentExecutingComponent': 1,
'AgentSchedulingComponent': 1,
'AgentStagingInputComponent': 1,
'AgentStagingOutputComponent': 1},
'pull_units': True,
'sub_agents': [],
'target': 'local'}},
'agent_name': 'agent_0',
'bulk_collection_time': 1.0,
'clone': {'AgentExecutingComponent': {'input': 1, 'output': 1},
'AgentSchedulingComponent': {'input': 1, 'output': 1},
'AgentStagingInputComponent': {'input': 1, 'output': 1},
'AgentStagingOutputComponent': {'input': 1, 'output': 1},
'AgentWorker': {'input': 1, 'output': 1}},
'cores': 24,
'db_poll_sleeptime': 0.1,
'debug': 40,
'drop': {'AgentExecutingComponent': {'input': 1, 'output': 1},
'AgentSchedulingComponent': {'input': 1, 'output': 1},
'AgentStagingInputComponent': {'input': 1, 'output': 1},
'AgentStagingOutputComponent': {'input': 1, 'output': 1},
'AgentWorker': {'input': 1, 'output': 1}},
'heartbeat_interval': 10,
'lrms': 'PBSPRO',
'max_io_loglength': 1024,
'mongodb_url': 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
'mpi_launch_method': 'APRUN',
'pilot_id': 'pilot.0000',
'runtime': 20,
'scheduler': 'CONTINUOUS',
'session_id': 'rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001',
'spawner': 'POPEN',
'staging_area': 'staging_area',
'staging_scheme': 'staging',
'task_launch_method': 'APRUN'}
FAILED startup
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
bridges = start_bridges(cfg, log)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
bridge = _create_bridge(b)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
return impl(flavor, name, role, address)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 434, in __init__
raise RuntimeError ("bridge did not come up! (%s)" % e)
bootstrap_3 done
atexit
Caught SIGTERM. EXITING (<frame object at 0x14d99d0>)
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6065, in exit_handler
sys.exit(1)
cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x144edb0>)
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/threading.py", line 1100, in _exitfunc
def _exitfunc(self):
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6075, in sigterm_handler
pilot_FAILED(msg='Caught SIGTERM. EXITING (%s)' % frame)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
print ru.get_trace()
cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x1500840>)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6226, in <module>
bootstrap_3()
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
bridges = start_bridges(cfg, log)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
bridge = _create_bridge(b)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
return impl(flavor, name, role, address)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 429, in __init__
self._p.start()
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
code = process_obj._bootstrap()
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 417, in _bridge
events = dict(_uninterruptible(_poll.poll, 1000)) # timeout in ms
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 42, in _uninterruptible
return f(*args, **kwargs)
File "/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/zmq/sugar/poll.py", line 101, in poll
return zmq_poll(self.sockets, timeout=timeout)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6075, in sigterm_handler
pilot_FAILED(msg='Caught SIGTERM. EXITING (%s)' % frame)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
print ru.get_trace()
cannot log error state in database!
sigterm
update: formatted by marksantcroos
raise RuntimeError ("bridge did not come up! (%s)" % e)
This is the (semi) root cause. Let's see what the new run does, to exclude a temporary issue.
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
================================================================================
EnsembleMD (0.3.14)
================================================================================
Starting Allocation ok
Verifying pattern ok
Starting pattern execution ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'
Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete. done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-11-26 18:12:37,547: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
2015-11-26 18:12:37,548: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : Pattern execution FAILED.
2015-11-26 18:12:37,548: radical.pilot : MainProcess : Thread-3 : ERROR : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
self.call_unit_state_callbacks(unit_id, new_state)
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
cb(self._shared_data[unit_id]['facade_object'], new_state)
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 741, in execute_pattern
resource._umgr.wait_units()
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
time.sleep (0.5)
KeyboardInterrupt
Starting Deallocation
========================================
total 1408
-rw-r--r-- 1 ebreitmo e290 809859 Nov 26 14:35 saga-python-0.38.1.tar.gz
-rw-r--r-- 1 ebreitmo e290 104469 Nov 26 14:35 radical.utils-0.38.tar.gz
-rwxr-xr-x 1 ebreitmo e290 47400 Nov 26 14:35 bootstrap_1.sh
-rw-r--r-- 1 ebreitmo e290 230829 Nov 26 14:35 radical.pilot-0.37.10.tar.gz
-rw------- 1 ebreitmo e290 2439 Nov 26 14:35 agent_0.cfg
drwx--S--- 6 ebreitmo e290 4096 Nov 26 18:09 radical.utils-0.38
drwx--S--- 5 ebreitmo e290 4096 Nov 26 18:10 saga-python-0.38.1
drwx--S--- 5 ebreitmo e290 4096 Nov 26 18:10 radical.pilot-0.37.10
drwx--S--- 5 ebreitmo e290 4096 Nov 26 18:10 rp_install
-rw------- 1 ebreitmo e290 737 Nov 26 18:10 bootstrap_1.err
-rwxr-xr-x 1 ebreitmo e290 1692 Nov 26 18:11 bootstrap_2.sh
-rw------- 1 ebreitmo e290 5 Nov 26 18:11 agent_0.bootstrap_2.out
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.bootstrap_2.err
-rw------- 1 ebreitmo e290 608 Nov 26 18:11 agent_0.bootstrap_3.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentWorker.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.SchedulerContinuous.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.SchedulerContinuous.0.child.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentUpdateWorker.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentUpdateWorker.0.child.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentStagingOutputComponent.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentStagingInputComponent.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentStagingInputComponent.0.child.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentHeartbeatWorker.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentHeartbeatWorker.0.child.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentExecutingComponent_POPEN.0.log
-rw------- 1 ebreitmo e290 0 Nov 26 18:11 agent_0.AgentExecutingComponent_POPEN.0.child.log
drwxr-sr-x 3 ebreitmo e290 4096 Nov 26 18:11 unit.000000
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:11 unit.000003
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:11 unit.000001
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:11 unit.000008
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:11 unit.000004
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:12 unit.000007
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:12 unit.000002
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:12 unit.000006
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:12 unit.000005
drwx--S--- 2 ebreitmo e290 4096 Nov 26 18:12 unit.000009
-rw------- 1 ebreitmo e290 5281 Nov 26 18:12 pilot.0000.log.tgz
-rw------- 1 ebreitmo e290 94075 Nov 26 18:12 bootstrap_1.out
-rw------- 1 ebreitmo e290 21987 Nov 26 18:12 agent_0.out
-rw------- 1 ebreitmo e290 1424 Nov 26 18:12 agent_0.err
-rw------- 1 ebreitmo e290 2863 Nov 26 18:12 agent_0.AgentWorker.0.child.log
-rw------- 1 ebreitmo e290 2603 Nov 26 18:12 agent_0.AgentStagingOutputComponent.0.child.log
-rw------- 1 ebreitmo e290 20703 Nov 26 18:12 agent_0.AgentExecutingWatcher_POPEN.0.log
================================================
more agent_0.out
...
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016765.0000-pilot
.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
print ru.get_trace()
cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x1522360>)
...
BTW Re: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0' There is no gromacs-5.0.0 on ARCHER any more I thought this had already been taken care of?!
Hi, Any news on this one?
As its now down to a GROMACS issue thats not really something I can comment on.
Elena, I think to get the fix for the gromacs module load you need to install ensemblemd from the master branch. i.e.
(extasy-test)mbp-ib:extasy-test ibethune$ pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@master#egg=radical.ensemblemd
(extasy-test)mbp-ib:extasy-test ibethune$ ensemblemd-version
0.3.6-78-g621a019
Hm, I did this, still the same:
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
================================================================================
EnsembleMD (0.3.6-78-g621a019)
================================================================================
Starting Allocation ok
Verifying pattern ok
Starting pattern execution ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'
Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete. done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-12-02 10:02:46,231: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
2015-12-02 10:02:46,231: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : Pattern execution FAILED.
2015-12-02 10:02:46,231: radical.pilot : MainProcess : Thread-3 : ERROR : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
self.call_unit_state_callbacks(unit_id, new_state)
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
cb(self._shared_data[unit_id]['facade_object'], new_state)
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 743, in execute_pattern
resource._umgr.wait_units(uids)
File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
time.sleep (0.5)
KeyboardInterrupt
Starting DeallocationResource allocation cancelled. You probably ran out of walltime
done
In master, it shouldn't load gromacs/5.0.0 (https://github.com/radical-cybertools/radical.ensemblemd/blob/master/src/radical/ensemblemd/kernel_plugins/md/gromacs.py#L52). Could you try it with a fresh installation (and not "pip install --upgrade ... ") ?
Another thing to be fixed is the version number. master should print 0.3.14 (and not 0.3.6-78-g621a019). will check that
On archer, after I load the default gromacs.
vb224@eslogin006:~> env | grep GMX
GMX_DIR=/work/y07/y07/gmx/5.1-phase2
GMXLIB=/work/y07/y07/gmx/5.1-phase2/share/gromacs/top
GMX_INCLUDE_OPTS=/work/y07/y07/gmx/5.1-phase2/include
vb224@eslogin006:~> cd /work/y07/y07/gmx/5.1-phase2
vb224@eslogin006:/work/y07/y07/gmx/5.1-phase2> cd bin/
vb224@eslogin006:/work/y07/y07/gmx/5.1-phase2/bin> ls
demux.pl gmx-completion.bash gmx-completion-gmx_d.bash gmx-completion-mdrun_mpi_d.bash GMXRC GMXRC.csh mdrun_mpi xplor2gmx.pl
gmx gmx-completion-gmx.bash gmx-completion-mdrun_mpi.bash gmx_d GMXRC.bash GMXRC.zsh mdrun_mpi_d
I am not sure what the analogous commands are for the "grompp" and "mdrun" in gromacs/5.1.1.
Also, I would expect the simulations to have failed (no data to be produced). I'll see why there was no error reported.
pip install radical.ensemblemd
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
================================================================================
EnsembleMD (0.3.14)
================================================================================
Starting Allocation ok
Verifying pattern ok
Starting pattern execution ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'
Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete. done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-12-10 16:03:16,474: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
2015-12-10 16:03:16,474: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : Pattern execution FAILED.
2015-12-10 16:03:16,474: radical.pilot : MainProcess : Thread-3 : ERROR : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
self.call_unit_state_callbacks(unit_id, new_state)
File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
cb(self._shared_data[unit_id]['facade_object'], new_state)
File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 741, in execute_pattern
resource._umgr.wait_units()
File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
time.sleep (0.5)
KeyboardInterrupt
Starting Deallocation done
ls -lrt unit.000007
total 32
lrwxrwxrwx 1 ebreitmo e290 139 Dec 10 16:02 topol.top -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/topol.top
lrwxrwxrwx 1 ebreitmo e290 145 Dec 10 16:02 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/temp/start6.gro
lrwxrwxrwx 1 ebreitmo e290 136 Dec 10 16:02 run.py -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/run.py
-rwx------ 1 ebreitmo e290 780 Dec 10 16:02 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290 140 Dec 10 16:02 grompp.mdp -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/grompp.mdp
-rw------- 1 ebreitmo e290 788 Dec 10 16:02 run.sh
-rw------- 1 ebreitmo e290 95 Dec 10 16:02 STDOUT
-rw------- 1 ebreitmo e290 1620 Dec 10 16:02 STDERR
-rw------- 1 ebreitmo e290 0 Dec 10 16:02 out.gro
is the out.gro file empty or does it have contents ?
empty
ok.. the executables need to be changed to gmx grompp and gmx mdrun (gromacs 5.1.*).
Did you update it? Today I re-installed from scratch and get something new...
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
================================================================================
EnsembleMD (0.3.14)
================================================================================
Starting Allocation2015-12-14 10:59:13,913: radical.pilot : MainProcess : Thread-1 : ERROR : Couldn't call manager callback (no pilot instance)
2015-12-14 10:59:14,019: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during resource allocation: too many namespaces/collections.
Traceback (most recent call last):
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 188, in allocate
scheduler=radical.pilot.SCHED_DIRECT_SUBMISSION)
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 121, in __init__
session=self._session)
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 80, in __init__
output_transfer_workers=output_transfer_workers)
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/db/database.py", line 541, in insert_unit_manager
"output_transfer_workers": output_transfer_workers }
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
_check_write_command_response(results)
File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/pymongo/helpers.py", line 209, in _check_write_command_response
raise OperationFailure(error.get("errmsg"), error.get("code"), error)
OperationFailure: too many namespaces/collections
Allocation failed: too many namespaces/collections
Allocation failed: too many namespaces/collections
This means that the database you are using is full.
Is the database set by the user or is that an extasy wide configuration?
The solution is either to clean up, or use a different namespace.
In archer.rcfg
DBURL = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot' #MongoDB link to be used for coordination purposes
Thanks.
So this looks like a shared db. Either somebody needs to clean up that db or create individual accounts for use, as in principle you could use the following, if you would have an account on that db:
DBURL = 'mongodb://user:password@extasy-db.epcc.ed.ac.uk/elena'
Thanks, it's cleared now and I am back to the original issue.
I have made the change in the helper scripts. The executable would now be gmx. Please note that if you are getting "Unable to locate a modulefile for 'gromacs/5.0.0'" you might not have installed enmd properly/correct branch. Please use the master branch which loads the default version on archer.
Thanks Vivek, I will re-install. I try to follow the documentation and do pip install radical.ensemblemd
more agent_0.bootstrap_3.log
...
Traceback (most recent call last):
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
bridges = start_bridges(cfg, log)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
bridge = _create_bridge(b)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
return impl(flavor, name, role, address)
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 434, in __init__
raise RuntimeError ("bridge did not come up! (%s)" % e)
RuntimeError: bridge did not come up! ()
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
================================================================================
EnsembleMD (0.3.14)
================================================================================
Starting Allocation ok
Verifying pattern ok
Starting pattern execution ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'
Job waiting on queue...2015-12-15 10:38:19,218: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Resource error: real 1450175889.884460 sec | user 0.196 sec | system 0.616 sec | mem 40044.00 kB
2015-12-15 10:38:19,218: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Pattern execution FAILED.
2015-12-15 10:38:19,218: radical.pilot : MainProcess : Thread-1 : ERROR : pilot manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run
self.call_callbacks(pilot_id, new_state)
File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
cb(self._shared_data[pilot_id]['facade_object'](), new_state)
File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 127, in pilot_state_cb
sys.exit(2)
SystemExit: 2
2015-12-15 10:38:19,527: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: .
Fatal error during execution: .Starting Deallocation
raise RuntimeError ("bridge did not come up! (%s)" % e)
RuntimeError: bridge did not come up! ()
The pilot seems to be failing while the job was pending.
Elena,
could you please make the log files etc. readable in the sandbox, or copy them somewhere? I would like to have a look at those. Indeed, the pilot seems to be failing, as it could not bring up its communication infrastructure...
I copied everything to /epsrc/e290/e290/ebreitmo on ARCHER.
Could you set read permissions for all the files also please. I get a permission denied while trying to read them.
Oops, sorry - hopefully ok now.
Thanks Elena! Alas, I could not really find anything helpful. The RP tests seem to run on archer just fine (with RP devel branch). Vivek, would it be possible to isolate the issue, or to create a step-by-step guide to reproduce it?
Yup, I'll try to run the same on archer and see if i can reproduce this.
I have attempted this multiple times, but not able to reproduce this specific error ("bridge did not come up"). Not sure how to debug this.
@ebreitmo are you able to run any simple RP scripts ?
Also, could you check if you get this on Stampede ? Might help isolating.
I have no account on stampede.
Is there any script in particular I should run for testing?
Changes to the script:
c = radical.pilot.Context('ssh')
c.user_id = "<username>"
session.add_context(c)
pdesc.resource = "epsrc.archer"
pdesc.cores = 24
pdesc.runtime = 20 # minutes
python simple_bot.py
File "simple_bot.py", line 151
finally:
^
SyntaxError: invalid syntax
Hmm, I do not see any syntax error there. Can you run the following:
python examples/03_multiple_pilots.py epsrc.archer
No settings in the code are needed, assuming you have passwordless ssh set up on command line...
Thanks!
python examples/03_multiple_pilots.py
================================================================================
Getting Started (RP version 0.37.10)
================================================================================
create session rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016787.0000 ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
create pilot descriptions \
create pilot description [local.localhost:64] ok
ok
submit 1 pilot(s) . ok
--------------------------------------------------------------------------------
submit units
create unit manager ok
add 1 pilot(s) ok
create 128 unit description(s)
........................................................................
........................................................ ok
submit 128 unit(s)
........................................................................
........................................................ ok
--------------------------------------------------------------------------------
gather results
wait for 128 unit(s)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ok
* unit.000000: Done, exit: 0, out: pilot.0000
* unit.000001: Done, exit: 0, out: pilot.0000
* unit.000002: Done, exit: 0, out: pilot.0000
* unit.000003: Done, exit: 0, out: pilot.0000
* unit.000004: Done, exit: 0, out: pilot.0000
* unit.000005: Done, exit: 0, out: pilot.0000
* unit.000006: Done, exit: 0, out: pilot.0000
* unit.000007: Done, exit: 0, out: pilot.0000
* unit.000008: Done, exit: 0, out: pilot.0000
* unit.000009: Done, exit: 0, out: pilot.0000
* unit.000010: Done, exit: 0, out: pilot.0000
* unit.000011: Done, exit: 0, out: pilot.0000
* unit.000012: Done, exit: 0, out: pilot.0000
* unit.000013: Done, exit: 0, out: pilot.0000
* unit.000014: Done, exit: 0, out: pilot.0000
* unit.000015: Done, exit: 0, out: pilot.0000
* unit.000016: Done, exit: 0, out: pilot.0000
* unit.000017: Done, exit: 0, out: pilot.0000
* unit.000018: Done, exit: 0, out: pilot.0000
* unit.000019: Done, exit: 0, out: pilot.0000
* unit.000020: Done, exit: 0, out: pilot.0000
* unit.000021: Done, exit: 0, out: pilot.0000
* unit.000022: Done, exit: 0, out: pilot.0000
* unit.000023: Done, exit: 0, out: pilot.0000
* unit.000024: Done, exit: 0, out: pilot.0000
* unit.000025: Done, exit: 0, out: pilot.0000
* unit.000026: Done, exit: 0, out: pilot.0000
* unit.000027: Done, exit: 0, out: pilot.0000
* unit.000028: Done, exit: 0, out: pilot.0000
* unit.000029: Done, exit: 0, out: pilot.0000
* unit.000030: Done, exit: 0, out: pilot.0000
* unit.000031: Done, exit: 0, out: pilot.0000
* unit.000032: Done, exit: 0, out: pilot.0000
* unit.000033: Done, exit: 0, out: pilot.0000
* unit.000034: Done, exit: 0, out: pilot.0000
* unit.000035: Done, exit: 0, out: pilot.0000
* unit.000036: Done, exit: 0, out: pilot.0000
* unit.000037: Done, exit: 0, out: pilot.0000
* unit.000038: Done, exit: 0, out: pilot.0000
* unit.000039: Done, exit: 0, out: pilot.0000
* unit.000040: Done, exit: 0, out: pilot.0000
* unit.000041: Done, exit: 0, out: pilot.0000
* unit.000042: Done, exit: 0, out: pilot.0000
* unit.000043: Done, exit: 0, out: pilot.0000
* unit.000044: Done, exit: 0, out: pilot.0000
* unit.000045: Done, exit: 0, out: pilot.0000
* unit.000046: Done, exit: 0, out: pilot.0000
* unit.000047: Done, exit: 0, out: pilot.0000
* unit.000048: Done, exit: 0, out: pilot.0000
* unit.000049: Done, exit: 0, out: pilot.0000
* unit.000050: Done, exit: 0, out: pilot.0000
* unit.000051: Done, exit: 0, out: pilot.0000
* unit.000052: Done, exit: 0, out: pilot.0000
* unit.000053: Done, exit: 0, out: pilot.0000
* unit.000054: Done, exit: 0, out: pilot.0000
* unit.000055: Done, exit: 0, out: pilot.0000
* unit.000056: Done, exit: 0, out: pilot.0000
* unit.000057: Done, exit: 0, out: pilot.0000
* unit.000058: Done, exit: 0, out: pilot.0000
* unit.000059: Done, exit: 0, out: pilot.0000
* unit.000060: Done, exit: 0, out: pilot.0000
* unit.000061: Done, exit: 0, out: pilot.0000
* unit.000062: Done, exit: 0, out: pilot.0000
* unit.000063: Done, exit: 0, out: pilot.0000
* unit.000064: Done, exit: 0, out: pilot.0000
* unit.000065: Done, exit: 0, out: pilot.0000
* unit.000066: Done, exit: 0, out: pilot.0000
* unit.000067: Done, exit: 0, out: pilot.0000
* unit.000068: Done, exit: 0, out: pilot.0000
* unit.000069: Done, exit: 0, out: pilot.0000
* unit.000070: Done, exit: 0, out: pilot.0000
* unit.000071: Done, exit: 0, out: pilot.0000
* unit.000072: Done, exit: 0, out: pilot.0000
* unit.000073: Done, exit: 0, out: pilot.0000
* unit.000074: Done, exit: 0, out: pilot.0000
* unit.000075: Done, exit: 0, out: pilot.0000
* unit.000076: Done, exit: 0, out: pilot.0000
* unit.000077: Done, exit: 0, out: pilot.0000
* unit.000078: Done, exit: 0, out: pilot.0000
* unit.000079: Done, exit: 0, out: pilot.0000
* unit.000080: Done, exit: 0, out: pilot.0000
* unit.000081: Done, exit: 0, out: pilot.0000
* unit.000082: Done, exit: 0, out: pilot.0000
* unit.000083: Done, exit: 0, out: pilot.0000
* unit.000084: Done, exit: 0, out: pilot.0000
* unit.000085: Done, exit: 0, out: pilot.0000
* unit.000086: Done, exit: 0, out: pilot.0000
* unit.000087: Done, exit: 0, out: pilot.0000
* unit.000088: Done, exit: 0, out: pilot.0000
* unit.000089: Done, exit: 0, out: pilot.0000
* unit.000090: Done, exit: 0, out: pilot.0000
* unit.000091: Done, exit: 0, out: pilot.0000
* unit.000092: Done, exit: 0, out: pilot.0000
* unit.000093: Done, exit: 0, out: pilot.0000
* unit.000094: Done, exit: 0, out: pilot.0000
* unit.000095: Done, exit: 0, out: pilot.0000
* unit.000096: Done, exit: 0, out: pilot.0000
* unit.000097: Done, exit: 0, out: pilot.0000
* unit.000098: Done, exit: 0, out: pilot.0000
* unit.000099: Done, exit: 0, out: pilot.0000
* unit.000100: Done, exit: 0, out: pilot.0000
* unit.000101: Done, exit: 0, out: pilot.0000
* unit.000102: Done, exit: 0, out: pilot.0000
* unit.000103: Done, exit: 0, out: pilot.0000
* unit.000104: Done, exit: 0, out: pilot.0000
* unit.000105: Done, exit: 0, out: pilot.0000
* unit.000106: Done, exit: 0, out: pilot.0000
* unit.000107: Done, exit: 0, out: pilot.0000
* unit.000108: Done, exit: 0, out: pilot.0000
* unit.000109: Done, exit: 0, out: pilot.0000
* unit.000110: Done, exit: 0, out: pilot.0000
* unit.000111: Done, exit: 0, out: pilot.0000
* unit.000112: Done, exit: 0, out: pilot.0000
* unit.000113: Done, exit: 0, out: pilot.0000
* unit.000114: Done, exit: 0, out: pilot.0000
* unit.000115: Done, exit: 0, out: pilot.0000
* unit.000116: Done, exit: 0, out: pilot.0000
* unit.000117: Done, exit: 0, out: pilot.0000
* unit.000118: Done, exit: 0, out: pilot.0000
* unit.000119: Done, exit: 0, out: pilot.0000
* unit.000120: Done, exit: 0, out: pilot.0000
* unit.000121: Done, exit: 0, out: pilot.0000
* unit.000122: Done, exit: 0, out: pilot.0000
* unit.000123: Done, exit: 0, out: pilot.0000
* unit.000124: Done, exit: 0, out: pilot.0000
* unit.000125: Done, exit: 0, out: pilot.0000
* unit.000126: Done, exit: 0, out: pilot.0000
* unit.000127: Done, exit: 0, out: pilot.0000
* pilot.0000 : 128
* total : 128
--------------------------------------------------------------------------------
finalize
closing session rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016787.0000 \
close pilot manager \
wait for 1 pilot(s) * ok
ok
close unit manager ok
session lifetime: 84.6s ok
--------------------------------------------------------------------------------
Hi there, any news?
Alas, I am unable to reproduce the problem in any way. Vivek, anything new from your end?
I tried again and can't reproduce this. Is there any chance of getting access to this machine or any machine where you can reproduce this ?
I also tried again and can't reproduce it. Both Iain and I now fail at a different stage. I will open a new ticket.
I also tried again and can't reproduce it. Both Iain and I now fail at a different stage. I will open a new ticket.
I installed RP from scratch. Then I export RADICAL_ENMD_VERBOSE=REPORT (which is not the default, as I understand it is expected to be in 6.2 of the docu).
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg
EnsembleMD (0.3.14)
Starting Allocation ok Verifying pattern ok
Starting pattern execution ok
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'
Job waiting on queue...2015-11-24 12:09:40,014: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Resource error: real 1448366970.858693 sec | user 0.216 sec | system 0.332 sec | mem 40020.00 kB 2015-11-24 12:09:40,014: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Pattern execution FAILED. 2015-11-24 12:09:40,014: radical.pilot : MainProcess : Thread-1 : ERROR : pilot manager controller thread caught system exit -- forcing application shutdown Traceback (most recent call last): File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run self.call_callbacks(pilot_id, new_state) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks cb(self._shared_data[pilot_id]['facade_object'](), new_state) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 127, in pilot_state_cb sys.exit(2) SystemExit: 2 2015-11-24 12:09:40,496: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: . Fatal error during execution: .Starting Deallocation2015-11-24 12:09:40,496: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: . Fatal error: . File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 244, in run plugin.execute_pattern(pattern, self) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 156, in execute_pattern resource._pmgr.wait_pilots(resource._pilot.uid,'Active') File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 519, in wait_pilots time.sleep(0.5)