radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

failure of rp 0.3.14 gromacs/lsdmap on archer #219

Closed ebreitmo closed 8 years ago

ebreitmo commented 8 years ago

I installed RP from scratch. Then I export RADICAL_ENMD_VERBOSE=REPORT (which is not the default, as I understand it is expected to be in 6.2 of the docu).

python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

EnsembleMD (0.3.14)

Starting Allocation ok Verifying pattern ok

Starting pattern execution ok

Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...2015-11-24 12:09:40,014: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Resource error: real 1448366970.858693 sec | user 0.216 sec | system 0.332 sec | mem 40020.00 kB 2015-11-24 12:09:40,014: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Pattern execution FAILED. 2015-11-24 12:09:40,014: radical.pilot : MainProcess : Thread-1 : ERROR : pilot manager controller thread caught system exit -- forcing application shutdown Traceback (most recent call last): File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run self.call_callbacks(pilot_id, new_state) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks cb(self._shared_data[pilot_id]['facade_object'](), new_state) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 127, in pilot_state_cb sys.exit(2) SystemExit: 2 2015-11-24 12:09:40,496: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: . Fatal error during execution: .Starting Deallocation2015-11-24 12:09:40,496: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: . Fatal error: . File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 244, in run plugin.execute_pattern(pattern, self) File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 156, in execute_pattern resource._pmgr.wait_pilots(resource._pilot.uid,'Active') File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 519, in wait_pilots time.sleep(0.5)

    done 
marksantcroos commented 8 years ago

The title says RP 0.3.14, but that sounds like an EnsembleMD version. What version of RP is this? (i.e. whats the output of radicalpilot-version?)

marksantcroos commented 8 years ago

The content of the pilot sandbox on ARCHER would be helpful (/work/e290/e290/$USER/radical.pilot.sandbox/$RP_SESSION)

marksantcroos commented 8 years ago

Hypothesizing about possible failure modes, you could try to re-run with 48 pilot cores.

ebreitmo commented 8 years ago

a) radicalpilot-version 0.37.10

b) On ARCHER: ls -lrt total 1316 -rwxr-xr-x 1 ebreitmo e290 47400 Nov 24 12:01 bootstrap_1.sh -rw-r--r-- 1 ebreitmo e290 809859 Nov 24 12:01 saga-python-0.38.1.tar.gz -rw-r--r-- 1 ebreitmo e290 104469 Nov 24 12:01 radical.utils-0.38.tar.gz -rw-r--r-- 1 ebreitmo e290 230829 Nov 24 12:01 radical.pilot-0.37.10.tar.gz -rw------- 1 ebreitmo e290 2439 Nov 24 12:01 agent_0.cfg drwx--S--- 6 ebreitmo e290 4096 Nov 24 12:03 radical.utils-0.38 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:04 saga-python-0.38.1 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:04 radical.pilot-0.37.10 drwx--S--- 5 ebreitmo e290 4096 Nov 24 12:07 rp_install -rwxr-xr-x 1 ebreitmo e290 1692 Nov 24 12:09 bootstrap_2.sh -rw------- 1 ebreitmo e290 5 Nov 24 12:09 agent_0.bootstrap_2.out -rw------- 1 ebreitmo e290 0 Nov 24 12:09 agent_0.bootstrap_2.err -rw------- 1 ebreitmo e290 3633 Nov 24 12:09 agent_0.bootstrap_3.log -rw------- 1 ebreitmo e290 821 Nov 24 12:23 bootstrap_1.err -rw------- 1 ebreitmo e290 1401 Nov 24 12:23 agent_0.err -rw------- 1 ebreitmo e290 11196 Nov 24 12:23 agent_0.out -rw------- 1 ebreitmo e290 94134 Nov 24 12:23 bootstrap_1.out

c) python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

EnsembleMD (0.3.14)

Starting Allocation ok Verifying pattern ok

Starting pattern execution ok

Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'

Job waiting on queue...

marksantcroos commented 8 years ago

-rw------- 1 ebreitmo e290 11196 Nov 24 12:23 agent_0.out

Can you post the contents of this file?

ebreitmo commented 8 years ago

more agent_0.out

PYTHONPATH: ['/fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot
.0000/rp_install/bin', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages/Cython-0.21.1-py2.7-linux-
x86_64.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/work/e29
0/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/pip-1.3-py2.7.egg', '/work/e290/e290/ebreitmo/radical.pilot.san
dbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000/rp_install/lib/python2.7/site-packages', '/fs4/e2
90/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000', '/work/y07/y0
7/cse/pycairo/1.10.0/lib/python2.7/site-packages', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages', '/work/y
07/y07/cse/pygtk/2.24.0/lib/python2.7/site-packages/gtk-2.0', '/work/y07/y07/cse/yaml/pyyaml/3.11/lib/python2.7/site-packages
', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages', '/work/y07/y07/cse/mpi4py/1.3.1/lib/python2.
7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py', '/usr/local/packages/cse/bolt/0.6/modules', '/work/y07/
y07/cse/pygobject/2.21.3/lib/python2.7/site-packages/gtk-2.0', '/work/e290/shared/shared_pilot_ve_20150924/lib/python27.zip',
 '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/plat-l
inux2', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/py
thon2.7/lib-old', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-dynload', '/work/y07/y07/cse/python/2.7.6/lib
/python2.7', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/plat-linux2', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/lib-tk
', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/p
y']
python: 2.7.6 (default, Mar 10 2014, 14:13:45) 
[GCC 4.8.1 20130531 (Cray Inc.)]
utils : 0.38  : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pi
lot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.pyc
saga  : 0.38.1 : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-p
ilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.pyc
pilot : 0.37.10 : /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-
pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/__init__.pyc
        type  : multicore
        gitid : $Id$

---------------------------------------------------------------------

startup agent agent_0 : /fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.
0001-pilot.0000/agent_0.cfg
Agent config (/fs4/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot
.0000/agent_0.cfg):
{'agent_launch_method': 'APRUN',
 'agent_layout': {'agent_0': {'bridges': ['agent_staging_input_queue',
                                          'agent_scheduling_queue',
                                          'agent_executing_queue',
                                          'agent_staging_output_queue',
                                          'agent_unschedule_pubsub',
                                          'agent_reschedule_pubsub',
                                          'agent_command_pubsub',
                                          'agent_state_pubsub'],
                              'components': {'AgentExecutingComponent': 1,
                                             'AgentSchedulingComponent': 1,
                                             'AgentStagingInputComponent': 1,
                                             'AgentStagingOutputComponent': 1},
                              'pull_units': True,
                              'sub_agents': [],
                              'target': 'local'}},
 'agent_name': 'agent_0',
 'bulk_collection_time': 1.0,
 'clone': {'AgentExecutingComponent': {'input': 1, 'output': 1},
           'AgentSchedulingComponent': {'input': 1, 'output': 1},
           'AgentStagingInputComponent': {'input': 1, 'output': 1},
           'AgentStagingOutputComponent': {'input': 1, 'output': 1},
           'AgentWorker': {'input': 1, 'output': 1}},
 'cores': 24,
 'db_poll_sleeptime': 0.1,
 'debug': 40,
 'drop': {'AgentExecutingComponent': {'input': 1, 'output': 1},
          'AgentSchedulingComponent': {'input': 1, 'output': 1},
          'AgentStagingInputComponent': {'input': 1, 'output': 1},
          'AgentStagingOutputComponent': {'input': 1, 'output': 1},
          'AgentWorker': {'input': 1, 'output': 1}},
 'heartbeat_interval': 10,
 'lrms': 'PBSPRO',
 'max_io_loglength': 1024,
 'mongodb_url': 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
 'mpi_launch_method': 'APRUN',
 'pilot_id': 'pilot.0000',
 'runtime': 20,
 'scheduler': 'CONTINUOUS',
 'session_id': 'rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001',
 'spawner': 'POPEN',
 'staging_area': 'staging_area',
 'staging_scheme': 'staging',
 'task_launch_method': 'APRUN'}

FAILED startup
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
    bridges = start_bridges(cfg, log)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
    bridge     = _create_bridge(b)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
    return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
    return impl(flavor, name, role, address)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 434, in __init__
    raise RuntimeError ("bridge did not come up! (%s)" % e)

bootstrap_3 done
atexit
Caught SIGTERM. EXITING (<frame object at 0x14d99d0>)
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6065, in exit_handler
    sys.exit(1)

cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x144edb0>)
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/threading.py", line 1100, in _exitfunc
    def _exitfunc(self):
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6075, in sigterm_handler
    pilot_FAILED(msg='Caught SIGTERM. EXITING (%s)' % frame)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
    print ru.get_trace()

cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x1500840>)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6226, in <module>
    bootstrap_3()
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
    bridges = start_bridges(cfg, log)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
    bridge     = _create_bridge(b)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
    return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
    return impl(flavor, name, role, address)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 429, in __init__
    self._p.start()
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/forking.py", line 126, in __init__
    code = process_obj._bootstrap()
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/work/y07/y07/cse/python/2.7.6/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 417, in _bridge
    events = dict(_uninterruptible(_poll.poll, 1000)) # timeout in ms
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 42, in _uninterruptible
    return f(*args, **kwargs)
  File "/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/zmq/sugar/poll.py", line 101, in poll
    return zmq_poll(self.sockets, timeout=timeout)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 6075, in sigterm_handler
    pilot_FAILED(msg='Caught SIGTERM. EXITING (%s)' % frame)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016763.0001-pilot.0000
/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
    print ru.get_trace()

cannot log error state in database!
sigterm

update: formatted by marksantcroos

marksantcroos commented 8 years ago

raise RuntimeError ("bridge did not come up! (%s)" % e)

This is the (semi) root cause. Let's see what the new run does, to exclude a temporary issue.

ebreitmo commented 8 years ago
  python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete           done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-11-26 18:12:37,547: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found

2015-11-26 18:12:37,548: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2015-11-26 18:12:37,548: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 741, in execute_pattern
    resource._umgr.wait_units()
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting Deallocation                               

========================================
total 1408
-rw-r--r-- 1 ebreitmo e290 809859 Nov 26 14:35 saga-python-0.38.1.tar.gz
-rw-r--r-- 1 ebreitmo e290 104469 Nov 26 14:35 radical.utils-0.38.tar.gz
-rwxr-xr-x 1 ebreitmo e290  47400 Nov 26 14:35 bootstrap_1.sh
-rw-r--r-- 1 ebreitmo e290 230829 Nov 26 14:35 radical.pilot-0.37.10.tar.gz
-rw------- 1 ebreitmo e290   2439 Nov 26 14:35 agent_0.cfg
drwx--S--- 6 ebreitmo e290   4096 Nov 26 18:09 radical.utils-0.38
drwx--S--- 5 ebreitmo e290   4096 Nov 26 18:10 saga-python-0.38.1
drwx--S--- 5 ebreitmo e290   4096 Nov 26 18:10 radical.pilot-0.37.10
drwx--S--- 5 ebreitmo e290   4096 Nov 26 18:10 rp_install
-rw------- 1 ebreitmo e290    737 Nov 26 18:10 bootstrap_1.err
-rwxr-xr-x 1 ebreitmo e290   1692 Nov 26 18:11 bootstrap_2.sh
-rw------- 1 ebreitmo e290      5 Nov 26 18:11 agent_0.bootstrap_2.out
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.bootstrap_2.err
-rw------- 1 ebreitmo e290    608 Nov 26 18:11 agent_0.bootstrap_3.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentWorker.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.SchedulerContinuous.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.SchedulerContinuous.0.child.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentUpdateWorker.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentUpdateWorker.0.child.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentStagingOutputComponent.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentStagingInputComponent.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentStagingInputComponent.0.child.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentHeartbeatWorker.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentHeartbeatWorker.0.child.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentExecutingComponent_POPEN.0.log
-rw------- 1 ebreitmo e290      0 Nov 26 18:11 agent_0.AgentExecutingComponent_POPEN.0.child.log
drwxr-sr-x 3 ebreitmo e290   4096 Nov 26 18:11 unit.000000
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:11 unit.000003
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:11 unit.000001
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:11 unit.000008
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:11 unit.000004
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:12 unit.000007
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:12 unit.000002
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:12 unit.000006
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:12 unit.000005
drwx--S--- 2 ebreitmo e290   4096 Nov 26 18:12 unit.000009
-rw------- 1 ebreitmo e290   5281 Nov 26 18:12 pilot.0000.log.tgz
-rw------- 1 ebreitmo e290  94075 Nov 26 18:12 bootstrap_1.out
-rw------- 1 ebreitmo e290  21987 Nov 26 18:12 agent_0.out
-rw------- 1 ebreitmo e290   1424 Nov 26 18:12 agent_0.err
-rw------- 1 ebreitmo e290   2863 Nov 26 18:12 agent_0.AgentWorker.0.child.log
-rw------- 1 ebreitmo e290   2603 Nov 26 18:12 agent_0.AgentStagingOutputComponent.0.child.log
-rw------- 1 ebreitmo e290  20703 Nov 26 18:12 agent_0.AgentExecutingWatcher_POPEN.0.log
================================================
more agent_0.out
...
File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016765.0000-pilot
.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 291, in pilot_FAILED
    print ru.get_trace()

cannot log error state in database!
sigterm
Caught SIGTERM. EXITING (<frame object at 0x1522360>)
...
ebreitmo commented 8 years ago

BTW Re: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0' There is no gromacs-5.0.0 on ARCHER any more I thought this had already been taken care of?!

ebreitmo commented 8 years ago

Hi, Any news on this one?

marksantcroos commented 8 years ago

As its now down to a GROMACS issue thats not really something I can comment on.

ibethune commented 8 years ago

Elena, I think to get the fix for the gromacs module load you need to install ensemblemd from the master branch. i.e.

(extasy-test)mbp-ib:extasy-test ibethune$ pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@master#egg=radical.ensemblemd
(extasy-test)mbp-ib:extasy-test ibethune$ ensemblemd-version 
0.3.6-78-g621a019
ebreitmo commented 8 years ago

Hm, I did this, still the same:

python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.6-78-g621a019)                                                 
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete           done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-12-02 10:02:46,231: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found

2015-12-02 10:02:46,231: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2015-12-02 10:02:46,231: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 743, in execute_pattern
    resource._umgr.wait_units(uids)
  File "/Users/elenabreitmoser/24Nov/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting DeallocationResource allocation cancelled. You probably ran out of walltime
        done 
vivek-bala commented 8 years ago

In master, it shouldn't load gromacs/5.0.0 (https://github.com/radical-cybertools/radical.ensemblemd/blob/master/src/radical/ensemblemd/kernel_plugins/md/gromacs.py#L52). Could you try it with a fresh installation (and not "pip install --upgrade ... ") ?

vivek-bala commented 8 years ago

Another thing to be fixed is the version number. master should print 0.3.14 (and not 0.3.6-78-g621a019). will check that

vivek-bala commented 8 years ago

On archer, after I load the default gromacs.

vb224@eslogin006:~> env | grep GMX
GMX_DIR=/work/y07/y07/gmx/5.1-phase2
GMXLIB=/work/y07/y07/gmx/5.1-phase2/share/gromacs/top
GMX_INCLUDE_OPTS=/work/y07/y07/gmx/5.1-phase2/include
vb224@eslogin006:~> cd /work/y07/y07/gmx/5.1-phase2
vb224@eslogin006:/work/y07/y07/gmx/5.1-phase2> cd bin/
vb224@eslogin006:/work/y07/y07/gmx/5.1-phase2/bin> ls
demux.pl  gmx-completion.bash      gmx-completion-gmx_d.bash      gmx-completion-mdrun_mpi_d.bash  GMXRC       GMXRC.csh  mdrun_mpi    xplor2gmx.pl
gmx       gmx-completion-gmx.bash  gmx-completion-mdrun_mpi.bash  gmx_d                            GMXRC.bash  GMXRC.zsh  mdrun_mpi_d

I am not sure what the analogous commands are for the "grompp" and "mdrun" in gromacs/5.1.1.

vivek-bala commented 8 years ago

Also, I would expect the simulations to have failed (no data to be produced). I'll see why there was no error reported.

ebreitmo commented 8 years ago
pip install radical.ensemblemd
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete           done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-12-10 16:03:16,474: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found

2015-12-10 16:03:16,474: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2015-12-10 16:03:16,474: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 741, in execute_pattern
    resource._umgr.wait_units()
  File "/Users/elenabreitmoser/10Dec/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting Deallocation                                               done 
ls -lrt  unit.000007
total 32
lrwxrwxrwx 1 ebreitmo e290  139 Dec 10 16:02 topol.top -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/topol.top
lrwxrwxrwx 1 ebreitmo e290  145 Dec 10 16:02 start.gro -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/temp/start6.gro
lrwxrwxrwx 1 ebreitmo e290  136 Dec 10 16:02 run.py -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/run.py
-rwx------ 1 ebreitmo e290  780 Dec 10 16:02 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 ebreitmo e290  140 Dec 10 16:02 grompp.mdp -> /work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016779.0000-pilot.0000/unit.000000/grompp.mdp
-rw------- 1 ebreitmo e290  788 Dec 10 16:02 run.sh
-rw------- 1 ebreitmo e290   95 Dec 10 16:02 STDOUT
-rw------- 1 ebreitmo e290 1620 Dec 10 16:02 STDERR
-rw------- 1 ebreitmo e290    0 Dec 10 16:02 out.gro
vivek-bala commented 8 years ago

is the out.gro file empty or does it have contents ?

ebreitmo commented 8 years ago

empty

vivek-bala commented 8 years ago

ok.. the executables need to be changed to gmx grompp and gmx mdrun (gromacs 5.1.*).

ebreitmo commented 8 years ago

Did you update it? Today I re-installed from scratch and get something new...

python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation2015-12-14 10:59:13,913: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : Couldn't call manager callback (no pilot instance)
2015-12-14 10:59:14,019: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during resource allocation: too many namespaces/collections.
Traceback (most recent call last):
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 188, in allocate
    scheduler=radical.pilot.SCHED_DIRECT_SUBMISSION)
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 121, in __init__
    session=self._session)
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 80, in __init__
    output_transfer_workers=output_transfer_workers)
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/radical/pilot/db/database.py", line 541, in insert_unit_manager
    "output_transfer_workers": output_transfer_workers }
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/pymongo/collection.py", line 410, in insert
    _check_write_command_response(results)
  File "/Users/elenabreitmoser/14Dec/lib/python2.7/site-packages/pymongo/helpers.py", line 209, in _check_write_command_response
    raise OperationFailure(error.get("errmsg"), error.get("code"), error)
OperationFailure: too many namespaces/collections
Allocation failed: too many namespaces/collections
marksantcroos commented 8 years ago

Allocation failed: too many namespaces/collections

This means that the database you are using is full.

Is the database set by the user or is that an extasy wide configuration?

The solution is either to clean up, or use a different namespace.

ebreitmo commented 8 years ago

In archer.rcfg

DBURL = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot' #MongoDB link to be used for coordination purposes

marksantcroos commented 8 years ago

Thanks.

So this looks like a shared db. Either somebody needs to clean up that db or create individual accounts for use, as in principle you could use the following, if you would have an account on that db:

DBURL = 'mongodb://user:password@extasy-db.epcc.ed.ac.uk/elena'
ebreitmo commented 8 years ago

Thanks, it's cleared now and I am back to the original issue.

vivek-bala commented 8 years ago

I have made the change in the helper scripts. The executable would now be gmx. Please note that if you are getting "Unable to locate a modulefile for 'gromacs/5.0.0'" you might not have installed enmd properly/correct branch. Please use the master branch which loads the default version on archer.

ebreitmo commented 8 years ago

Thanks Vivek, I will re-install. I try to follow the documentation and do pip install radical.ensemblemd

ebreitmo commented 8 years ago

more agent_0.bootstrap_3.log

...
    Traceback (most recent call last):
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 6125, in bootstrap_3
    bridges = start_bridges(cfg, log)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 5934, in start_bridges
    bridge     = _create_bridge(b)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/bin/radical-pilot-agent-multicore.py", line 5921, in _create_bridge
    return rpu.Queue.create(rpu.QUEUE_ZMQ, name, rpu.QUEUE_BRIDGE)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 201, in create
    return impl(flavor, name, role, address)
  File "/work/e290/e290/ebreitmo/radical.pilot.sandbox/rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016784.0000-pilot.0000/rp
_install/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 434, in __init__
    raise RuntimeError ("bridge did not come up! (%s)" % e)
RuntimeError: bridge did not come up! ()
python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...2015-12-15 10:38:19,218: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: real 1450175889.884460 sec | user 0.196 sec | system 0.616 sec | mem 40044.00 kB
2015-12-15 10:38:19,218: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2015-12-15 10:38:19,218: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : pilot manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run
    self.call_callbacks(pilot_id, new_state)
  File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/Users/elenabreitmoser/15Dec/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 127, in pilot_state_cb
    sys.exit(2)
SystemExit: 2
2015-12-15 10:38:19,527: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error during execution: .Starting Deallocation
vivek-bala commented 8 years ago
raise RuntimeError ("bridge did not come up! (%s)" % e)
RuntimeError: bridge did not come up! ()

The pilot seems to be failing while the job was pending.

andre-merzky commented 8 years ago

Elena,

could you please make the log files etc. readable in the sandbox, or copy them somewhere? I would like to have a look at those. Indeed, the pilot seems to be failing, as it could not bring up its communication infrastructure...

ebreitmo commented 8 years ago

I copied everything to /epsrc/e290/e290/ebreitmo on ARCHER.

vivek-bala commented 8 years ago

Could you set read permissions for all the files also please. I get a permission denied while trying to read them.

ebreitmo commented 8 years ago

Oops, sorry - hopefully ok now.

andre-merzky commented 8 years ago

Thanks Elena! Alas, I could not really find anything helpful. The RP tests seem to run on archer just fine (with RP devel branch). Vivek, would it be possible to isolate the issue, or to create a step-by-step guide to reproduce it?

vivek-bala commented 8 years ago

Yup, I'll try to run the same on archer and see if i can reproduce this.

vivek-bala commented 8 years ago

I have attempted this multiple times, but not able to reproduce this specific error ("bridge did not come up"). Not sure how to debug this.

@ebreitmo are you able to run any simple RP scripts ?

vivek-bala commented 8 years ago

Also, could you check if you get this on Stampede ? Might help isolating.

ebreitmo commented 8 years ago

I have no account on stampede.

ebreitmo commented 8 years ago

Is there any script in particular I should run for testing?

vivek-bala commented 8 years ago

You can try this: https://github.com/radical-cybertools/radical.pilot/blob/devel/examples/docs/simple_bot.py

vivek-bala commented 8 years ago

Changes to the script:

c = radical.pilot.Context('ssh')
c.user_id = "<username>"
session.add_context(c)
pdesc.resource = "epsrc.archer"  
pdesc.cores    =  24
pdesc.runtime  = 20    # minutes
ebreitmo commented 8 years ago
python simple_bot.py
  File "simple_bot.py", line 151
    finally:
          ^
SyntaxError: invalid syntax
andre-merzky commented 8 years ago

Hmm, I do not see any syntax error there. Can you run the following:

python examples/03_multiple_pilots.py epsrc.archer

No settings in the code are needed, assuming you have passwordless ssh set up on command line...

Thanks!

ebreitmo commented 8 years ago
python examples/03_multiple_pilots.py 

================================================================================
 Getting Started (RP version 0.37.10)                                           
================================================================================

create session rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016787.0000    ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots                                                                   

create pilot manager                                                          ok
create pilot descriptions                                                      \
create pilot description [local.localhost:64]                                 ok
                                                                              ok
submit 1 pilot(s) .                                                           ok

--------------------------------------------------------------------------------
submit units                                                                    

create unit manager                                                           ok
add 1 pilot(s)                                                                ok
create 128 unit description(s)
        ........................................................................
        ........................................................              ok
submit 128 unit(s)
        ........................................................................
        ........................................................              ok

--------------------------------------------------------------------------------
gather results                                                                  

wait for 128 unit(s)
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++             ok

  * unit.000000: Done, exit:   0, out: pilot.0000
  * unit.000001: Done, exit:   0, out: pilot.0000
  * unit.000002: Done, exit:   0, out: pilot.0000
  * unit.000003: Done, exit:   0, out: pilot.0000
  * unit.000004: Done, exit:   0, out: pilot.0000
  * unit.000005: Done, exit:   0, out: pilot.0000
  * unit.000006: Done, exit:   0, out: pilot.0000
  * unit.000007: Done, exit:   0, out: pilot.0000
  * unit.000008: Done, exit:   0, out: pilot.0000
  * unit.000009: Done, exit:   0, out: pilot.0000
  * unit.000010: Done, exit:   0, out: pilot.0000
  * unit.000011: Done, exit:   0, out: pilot.0000
  * unit.000012: Done, exit:   0, out: pilot.0000
  * unit.000013: Done, exit:   0, out: pilot.0000
  * unit.000014: Done, exit:   0, out: pilot.0000
  * unit.000015: Done, exit:   0, out: pilot.0000
  * unit.000016: Done, exit:   0, out: pilot.0000
  * unit.000017: Done, exit:   0, out: pilot.0000
  * unit.000018: Done, exit:   0, out: pilot.0000
  * unit.000019: Done, exit:   0, out: pilot.0000
  * unit.000020: Done, exit:   0, out: pilot.0000
  * unit.000021: Done, exit:   0, out: pilot.0000
  * unit.000022: Done, exit:   0, out: pilot.0000
  * unit.000023: Done, exit:   0, out: pilot.0000
  * unit.000024: Done, exit:   0, out: pilot.0000
  * unit.000025: Done, exit:   0, out: pilot.0000
  * unit.000026: Done, exit:   0, out: pilot.0000
  * unit.000027: Done, exit:   0, out: pilot.0000
  * unit.000028: Done, exit:   0, out: pilot.0000
  * unit.000029: Done, exit:   0, out: pilot.0000
  * unit.000030: Done, exit:   0, out: pilot.0000
  * unit.000031: Done, exit:   0, out: pilot.0000
  * unit.000032: Done, exit:   0, out: pilot.0000
  * unit.000033: Done, exit:   0, out: pilot.0000
  * unit.000034: Done, exit:   0, out: pilot.0000
  * unit.000035: Done, exit:   0, out: pilot.0000
  * unit.000036: Done, exit:   0, out: pilot.0000
  * unit.000037: Done, exit:   0, out: pilot.0000
  * unit.000038: Done, exit:   0, out: pilot.0000
  * unit.000039: Done, exit:   0, out: pilot.0000
  * unit.000040: Done, exit:   0, out: pilot.0000
  * unit.000041: Done, exit:   0, out: pilot.0000
  * unit.000042: Done, exit:   0, out: pilot.0000
  * unit.000043: Done, exit:   0, out: pilot.0000
  * unit.000044: Done, exit:   0, out: pilot.0000
  * unit.000045: Done, exit:   0, out: pilot.0000
  * unit.000046: Done, exit:   0, out: pilot.0000
  * unit.000047: Done, exit:   0, out: pilot.0000
  * unit.000048: Done, exit:   0, out: pilot.0000
  * unit.000049: Done, exit:   0, out: pilot.0000
  * unit.000050: Done, exit:   0, out: pilot.0000
  * unit.000051: Done, exit:   0, out: pilot.0000
  * unit.000052: Done, exit:   0, out: pilot.0000
  * unit.000053: Done, exit:   0, out: pilot.0000
  * unit.000054: Done, exit:   0, out: pilot.0000
  * unit.000055: Done, exit:   0, out: pilot.0000
  * unit.000056: Done, exit:   0, out: pilot.0000
  * unit.000057: Done, exit:   0, out: pilot.0000
  * unit.000058: Done, exit:   0, out: pilot.0000
  * unit.000059: Done, exit:   0, out: pilot.0000
  * unit.000060: Done, exit:   0, out: pilot.0000
  * unit.000061: Done, exit:   0, out: pilot.0000
  * unit.000062: Done, exit:   0, out: pilot.0000
  * unit.000063: Done, exit:   0, out: pilot.0000
  * unit.000064: Done, exit:   0, out: pilot.0000
  * unit.000065: Done, exit:   0, out: pilot.0000
  * unit.000066: Done, exit:   0, out: pilot.0000
  * unit.000067: Done, exit:   0, out: pilot.0000
  * unit.000068: Done, exit:   0, out: pilot.0000
  * unit.000069: Done, exit:   0, out: pilot.0000
  * unit.000070: Done, exit:   0, out: pilot.0000
  * unit.000071: Done, exit:   0, out: pilot.0000
  * unit.000072: Done, exit:   0, out: pilot.0000
  * unit.000073: Done, exit:   0, out: pilot.0000
  * unit.000074: Done, exit:   0, out: pilot.0000
  * unit.000075: Done, exit:   0, out: pilot.0000
  * unit.000076: Done, exit:   0, out: pilot.0000
  * unit.000077: Done, exit:   0, out: pilot.0000
  * unit.000078: Done, exit:   0, out: pilot.0000
  * unit.000079: Done, exit:   0, out: pilot.0000
  * unit.000080: Done, exit:   0, out: pilot.0000
  * unit.000081: Done, exit:   0, out: pilot.0000
  * unit.000082: Done, exit:   0, out: pilot.0000
  * unit.000083: Done, exit:   0, out: pilot.0000
  * unit.000084: Done, exit:   0, out: pilot.0000
  * unit.000085: Done, exit:   0, out: pilot.0000
  * unit.000086: Done, exit:   0, out: pilot.0000
  * unit.000087: Done, exit:   0, out: pilot.0000
  * unit.000088: Done, exit:   0, out: pilot.0000
  * unit.000089: Done, exit:   0, out: pilot.0000
  * unit.000090: Done, exit:   0, out: pilot.0000
  * unit.000091: Done, exit:   0, out: pilot.0000
  * unit.000092: Done, exit:   0, out: pilot.0000
  * unit.000093: Done, exit:   0, out: pilot.0000
  * unit.000094: Done, exit:   0, out: pilot.0000
  * unit.000095: Done, exit:   0, out: pilot.0000
  * unit.000096: Done, exit:   0, out: pilot.0000
  * unit.000097: Done, exit:   0, out: pilot.0000
  * unit.000098: Done, exit:   0, out: pilot.0000
  * unit.000099: Done, exit:   0, out: pilot.0000
  * unit.000100: Done, exit:   0, out: pilot.0000
  * unit.000101: Done, exit:   0, out: pilot.0000
  * unit.000102: Done, exit:   0, out: pilot.0000
  * unit.000103: Done, exit:   0, out: pilot.0000
  * unit.000104: Done, exit:   0, out: pilot.0000
  * unit.000105: Done, exit:   0, out: pilot.0000
  * unit.000106: Done, exit:   0, out: pilot.0000
  * unit.000107: Done, exit:   0, out: pilot.0000
  * unit.000108: Done, exit:   0, out: pilot.0000
  * unit.000109: Done, exit:   0, out: pilot.0000
  * unit.000110: Done, exit:   0, out: pilot.0000
  * unit.000111: Done, exit:   0, out: pilot.0000
  * unit.000112: Done, exit:   0, out: pilot.0000
  * unit.000113: Done, exit:   0, out: pilot.0000
  * unit.000114: Done, exit:   0, out: pilot.0000
  * unit.000115: Done, exit:   0, out: pilot.0000
  * unit.000116: Done, exit:   0, out: pilot.0000
  * unit.000117: Done, exit:   0, out: pilot.0000
  * unit.000118: Done, exit:   0, out: pilot.0000
  * unit.000119: Done, exit:   0, out: pilot.0000
  * unit.000120: Done, exit:   0, out: pilot.0000
  * unit.000121: Done, exit:   0, out: pilot.0000
  * unit.000122: Done, exit:   0, out: pilot.0000
  * unit.000123: Done, exit:   0, out: pilot.0000
  * unit.000124: Done, exit:   0, out: pilot.0000
  * unit.000125: Done, exit:   0, out: pilot.0000
  * unit.000126: Done, exit:   0, out: pilot.0000
  * unit.000127: Done, exit:   0, out: pilot.0000

  * pilot.0000          : 128
  * total               : 128

--------------------------------------------------------------------------------
finalize                                                                        

closing session rp.session.mbp-eb.epcc.ed.ac.uk.elenabreitmoser.016787.0000    \
close pilot manager                                                            \
wait for 1 pilot(s) *                                                         ok
                                                                              ok
close unit manager                                                            ok
session lifetime: 84.6s                                                       ok

--------------------------------------------------------------------------------
ebreitmo commented 8 years ago

Hi there, any news?

andre-merzky commented 8 years ago

Alas, I am unable to reproduce the problem in any way. Vivek, anything new from your end?

vivek-bala commented 8 years ago

I tried again and can't reproduce this. Is there any chance of getting access to this machine or any machine where you can reproduce this ?

ebreitmo commented 8 years ago

I also tried again and can't reproduce it. Both Iain and I now fail at a different stage. I will open a new ticket.

ebreitmo commented 8 years ago

I also tried again and can't reproduce it. Both Iain and I now fail at a different stage. I will open a new ticket.