Closed ibethune closed 8 years ago
P.S. I see the same error at the end of a Gromacs/LSDMap workflow too
This error pops up because the pilot is ending up in a Failed state while cancelling (it is supposed to end up in Cancelled or Done if the pilot runs out of walltime). Could you post the .err files from the agent side (they should be in the pilot folder) ?
The only one with anything much in it is the agent_0.err:
2016-02-04 12:31:59,598: radical.saga : MainProcess : MainThread : INFO : python.interpreter version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-02-04 12:31:59,599: radical.saga : MainProcess : MainThread : INFO : pid: 4470
2016-02-04 12:31:59,599: radical.saga : MainProcess : MainThread : INFO : tid: MainThread
2016-02-04 12:31:59,599: radical.saga : MainProcess : MainThread : INFO : radical.saga version: 0.40
2016-02-04 12:31:59,609: radical.pilot : MainProcess : MainThread : INFO : python.interpreter version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-02-04 12:31:59,609: radical.pilot : MainProcess : MainThread : INFO : pid: 4470
2016-02-04 12:31:59,609: radical.pilot : MainProcess : MainThread : INFO : tid: MainThread
2016-02-04 12:31:59,609: radical.pilot : MainProcess : MainThread : INFO : radical.pilot version: 0.40
Exception KeyError: KeyError(139661983102816,) in <module 'threading' from '/work/y07/y07/cse/python/2.7.6/lib/python2.7/threading.pyc'> ignored
hmm.. could you post the contents of agent_0.out as well ?
e290ib@eslogin002:/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000> cat agent_0.out
---------------------------------------------------------------------
PYTHONPATH: ['/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages/Cython-0.21.1-py2.7-linux-x86_64.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/pip-1.3-py2.7.egg', '/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages', '/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000', '/work/y07/y07/cse/pycairo/1.10.0/lib/python2.7/site-packages', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages', '/work/y07/y07/cse/pygtk/2.24.0/lib/python2.7/site-packages/gtk-2.0', '/work/y07/y07/cse/yaml/pyyaml/3.11/lib/python2.7/site-packages', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages', '/work/y07/y07/cse/mpi4py/1.3.1/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py', '/usr/local/packages/cse/bolt/0.6/modules', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages/gtk-2.0', '/work/e290/shared/shared_pilot_ve_20150924/lib/python27.zip', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/plat-linux2', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-old', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-dynload', '/work/y07/y07/cse/python/2.7.6/lib/python2.7', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/plat-linux2', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py']
python: 2.7.6 (default, Mar 10 2014, 14:13:45)
[GCC 4.8.1 20130531 (Cray Inc.)]
utils : 0.40 : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.pyc
saga : 0.40 : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.pyc
pilot : 0.40 : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/__init__.pyc
type : multicore
gitid : $Id$
---------------------------------------------------------------------
startup agent agent_0 : /fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/agent_0.cfg
Agent config (/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/agent_0.cfg):
{'agent_launch_method': 'APRUN',
'agent_layout': {'agent_0': {'bridges': ['agent_staging_input_queue',
'agent_scheduling_queue',
'agent_executing_queue',
'agent_staging_output_queue',
'agent_unschedule_pubsub',
'agent_reschedule_pubsub',
'agent_command_pubsub',
'agent_state_pubsub'],
'components': {'AgentExecutingComponent': 1,
'AgentSchedulingComponent': 1,
'AgentStagingInputComponent': 1,
'AgentStagingOutputComponent': 1},
'pull_units': True,
'sub_agents': [],
'target': 'local'}},
'agent_name': 'agent_0',
'bulk_collection_time': 1.0,
'clone': {'AgentExecutingComponent': {'input': 1, 'output': 1},
'AgentSchedulingComponent': {'input': 1, 'output': 1},
'AgentStagingInputComponent': {'input': 1, 'output': 1},
'AgentStagingOutputComponent': {'input': 1, 'output': 1},
'AgentWorker': {'input': 1, 'output': 1}},
'cores': 24,
'db_poll_sleeptime': 0.1,
'debug': 40,
'drop': {'AgentExecutingComponent': {'input': 1, 'output': 1},
'AgentSchedulingComponent': {'input': 1, 'output': 1},
'AgentStagingInputComponent': {'input': 1, 'output': 1},
'AgentStagingOutputComponent': {'input': 1, 'output': 1},
'AgentWorker': {'input': 1, 'output': 1}},
'heartbeat_interval': 10,
'lrms': 'PBSPRO',
'max_io_loglength': 1024,
'mongodb_url': 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
'mpi_launch_method': 'APRUN',
'pilot_id': 'pilot.0000',
'runtime': 20,
'scheduler': 'CONTINUOUS',
'session_id': 'rp.session.Iains-MBP.home.ibethune.016835.0002',
'spawner': 'POPEN',
'staging_area': 'staging_area',
'staging_scheme': 'staging',
'task_launch_method': 'APRUN'}
FAILED startup
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 665, in <module>
bootstrap_3()
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 641, in bootstrap_3
pilot_FAILED(mongo_p, pilot_id, log, "FAILED startup")
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 194, in pilot_FAILED
print ru.get_trace()
bootstrap_3 done
atexit
This is fixed in RP devel branch.
Uusing the latest master version of enmd and running the standard COCO/AMBER workflow on ARCHER (the latest one from extasy-data).
Everything goes fine up till the point when the workflow starts to terminate. I see the following output which is suggestive this could be related to the use of shared data that Vivek just implemented?
If you need any further data to help debug, let me know but I didn't see anything suspicious in the output on the ARCHER side.