radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

SingleClusterEnvironment error on shutdown #239

Closed ibethune closed 8 years ago

ibethune commented 8 years ago

Uusing the latest master version of enmd and running the standard COCO/AMBER workflow on ARCHER (the latest one from extasy-data).

(extasy-test)mbp-ib:coam-on-archer ibethune$ ensemblemd-version 
0.3.14-27-g65bc062

Everything goes fine up till the point when the workflow starts to terminate. I see the following output which is suggestive this could be related to the use of shared data that Vivek just implemented?

If you need any further data to help debug, let me know but I didn't see anything suspicious in the output on the ARCHER side.

Starting Deallocation..
2016-02-04 12:19:42,387: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Deallocating Cluster
closing session rp.session.Iains-MBP.home.ibethune.016835.00012016-02-04 12:19:42,388: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.Iains-MBP.home.ibethune.016835.0001 closing
2016-02-04 12:19:42,388: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.Iains-MBP.home.ibethune.016835.0001 closes   pmgr   pmgr.0000
2016-02-04 12:19:42,388: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 closing
                 \
close pilot manager2016-02-04 12:19:42,388: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancel   launcher Thread-1
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 disables launcher PilotLauncherWorker-1
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 disabling
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 disabled
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 disabled launcher PilotLauncherWorker-1
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 canceled launcher Thread-1
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : Reconnected to existing ComputePilot {'uid': 'pilot.0000', 'stdout': None, 'start_time': 1454588052.059059, 'resource_detail': {'cores_per_node': None, 'nodes': None}, 'submission_time': 1454587902.906514, 'logfile': None, 'resource': u'epsrc.archer', 'log': [<radical.pilot.logentry.Logentry object at 0x108010f10>, <radical.pilot.logentry.Logentry object at 0x10816bd10>, <radical.pilot.logentry.Logentry object at 0x10816bcd0>, <radical.pilot.logentry.Logentry object at 0x1081baf90>, <radical.pilot.logentry.Logentry object at 0x1081bafd0>, <radical.pilot.logentry.Logentry object at 0x1081ba210>, <radical.pilot.logentry.Logentry object at 0x1081ba250>, <radical.pilot.logentry.Logentry object at 0x1081ba6d0>, <radical.pilot.logentry.Logentry object at 0x1081ba810>], 'sandbox': u'sftp://login.archer.ac.uk/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0001-pilot.0000/', 'state': u'Active', 'stop_time': None, 'stderr': None}
2016-02-04 12:19:42,389: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancels  pilot  pilot.0000
2016-02-04 12:19:42,467: radical.pilot       : MainProcess                     : MainThread     : INFO    : Sent 'COMMAND_CANCEL_PILOT' command to pilots ['pilot.0000'].
2016-02-04 12:19:42,468: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : delay to actively cancel pilot pilot.0000: state Active
 -2016-02-04 12:19:43,703: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : Performing periodical health check for pilot.0000 (SAGA job id [pbspro+ssh://login.archer.ac.uk/]-[3488798])
2016-02-04 12:19:43,703: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : pilot pilot.0000 seems alive and well
\2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   itransfer InputFileTransferWorker-1
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-1 stopping
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-1 stopped
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped itransfer InputFileTransferWorker-1
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   itransfer InputFileTransferWorker-2
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-2 stopping
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-2 stopped
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped itransfer InputFileTransferWorker-2
2016-02-04 12:19:45,278: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   otransfer OutputFileTransferWorker-1
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-1 stopping
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-1 stopped
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped otransfer OutputFileTransferWorker-1
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   otransfer OutputFileTransferWorker-2
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-2 stopping
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-2 stopped
2016-02-04 12:19:45,279: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped otransfer OutputFileTransferWorker-2
|2016-02-04 12:19:46,906: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : Performing periodical health check for pilot.0000 (SAGA job id [pbspro+ssh://login.archer.ac.uk/]-[3488798])
2016-02-04 12:19:46,906: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : pilot pilot.0000 seems alive and well
|2016-02-04 12:19:48,042: radical.pilot       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed from 'Active' to 'Failed'.
2016-02-04 12:19:48,042: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : INFO    : Resource epsrc.archer state has changed to Failed
2016-02-04 12:19:48,042: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: 
2016-02-04 12:19:48,042: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-02-04 12:19:48,042: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
Traceback (most recent call last):
  File "extasy_amber_coco.py", line 209, in <module>
    cluster.deallocate()
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 117, in deallocate
    self._session.close(cleanup=self._cleanup)
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/pilot/session.py", line 304, in close
    pmgr.close (terminate=terminate)
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 175, in close
    self.cancel_pilots()
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 579, in cancel_pilots
    self._worker.register_cancel_pilots_request(pilot_ids=pilot_ids)
  File "/Users/ibethune/Desktop/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 608, in register_cancel_pilots_request
    time.sleep(0.3)
ibethune commented 8 years ago

P.S. I see the same error at the end of a Gromacs/LSDMap workflow too

vivek-bala commented 8 years ago

This error pops up because the pilot is ending up in a Failed state while cancelling (it is supposed to end up in Cancelled or Done if the pilot runs out of walltime). Could you post the .err files from the agent side (they should be in the pilot folder) ?

ibethune commented 8 years ago

The only one with anything much in it is the agent_0.err:

2016-02-04 12:31:59,598: radical.saga        : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-02-04 12:31:59,599: radical.saga        : MainProcess                     : MainThread     : INFO    :                      pid: 4470
2016-02-04 12:31:59,599: radical.saga        : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2016-02-04 12:31:59,599: radical.saga        : MainProcess                     : MainThread     : INFO    : radical.saga         version: 0.40
2016-02-04 12:31:59,609: radical.pilot       : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-02-04 12:31:59,609: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      pid: 4470
2016-02-04 12:31:59,609: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2016-02-04 12:31:59,609: radical.pilot       : MainProcess                     : MainThread     : INFO    : radical.pilot        version: 0.40
Exception KeyError: KeyError(139661983102816,) in <module 'threading' from '/work/y07/y07/cse/python/2.7.6/lib/python2.7/threading.pyc'> ignored
vivek-bala commented 8 years ago

hmm.. could you post the contents of agent_0.out as well ?

ibethune commented 8 years ago
e290ib@eslogin002:/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000> cat agent_0.out 
---------------------------------------------------------------------

PYTHONPATH: ['/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages/Cython-0.21.1-py2.7-linux-x86_64.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages/pip-1.3-py2.7.egg', '/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages', '/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000', '/work/y07/y07/cse/pycairo/1.10.0/lib/python2.7/site-packages', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages', '/work/y07/y07/cse/pygtk/2.24.0/lib/python2.7/site-packages/gtk-2.0', '/work/y07/y07/cse/yaml/pyyaml/3.11/lib/python2.7/site-packages', '/work/y07/y07/cse/python/modules/cython/0.21.1/lib/python2.7/site-packages', '/work/y07/y07/cse/mpi4py/1.3.1/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py', '/usr/local/packages/cse/bolt/0.6/modules', '/work/y07/y07/cse/pygobject/2.21.3/lib/python2.7/site-packages/gtk-2.0', '/work/e290/shared/shared_pilot_ve_20150924/lib/python27.zip', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/plat-linux2', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-old', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/lib-dynload', '/work/y07/y07/cse/python/2.7.6/lib/python2.7', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/plat-linux2', '/work/y07/y07/cse/python/2.7.6/lib/python2.7/lib-tk', '/work/e290/shared/shared_pilot_ve_20150924/lib/python2.7/site-packages', '/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/lib64/py']
python: 2.7.6 (default, Mar 10 2014, 14:13:45) 
[GCC 4.8.1 20130531 (Cray Inc.)]
utils : 0.40  : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.pyc
saga  : 0.40  : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.pyc
pilot : 0.40  : /work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/__init__.pyc
        type  : multicore
        gitid : $Id$

---------------------------------------------------------------------

startup agent agent_0 : /fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/agent_0.cfg
Agent config (/fs4/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/agent_0.cfg):
{'agent_launch_method': 'APRUN',
 'agent_layout': {'agent_0': {'bridges': ['agent_staging_input_queue',
                                          'agent_scheduling_queue',
                                          'agent_executing_queue',
                                          'agent_staging_output_queue',
                                          'agent_unschedule_pubsub',
                                          'agent_reschedule_pubsub',
                                          'agent_command_pubsub',
                                          'agent_state_pubsub'],
                              'components': {'AgentExecutingComponent': 1,
                                             'AgentSchedulingComponent': 1,
                                             'AgentStagingInputComponent': 1,
                                             'AgentStagingOutputComponent': 1},
                              'pull_units': True,
                              'sub_agents': [],
                              'target': 'local'}},
 'agent_name': 'agent_0',
 'bulk_collection_time': 1.0,
 'clone': {'AgentExecutingComponent': {'input': 1, 'output': 1},
           'AgentSchedulingComponent': {'input': 1, 'output': 1},
           'AgentStagingInputComponent': {'input': 1, 'output': 1},
           'AgentStagingOutputComponent': {'input': 1, 'output': 1},
           'AgentWorker': {'input': 1, 'output': 1}},
 'cores': 24,
 'db_poll_sleeptime': 0.1,
 'debug': 40,
 'drop': {'AgentExecutingComponent': {'input': 1, 'output': 1},
          'AgentSchedulingComponent': {'input': 1, 'output': 1},
          'AgentStagingInputComponent': {'input': 1, 'output': 1},
          'AgentStagingOutputComponent': {'input': 1, 'output': 1},
          'AgentWorker': {'input': 1, 'output': 1}},
 'heartbeat_interval': 10,
 'lrms': 'PBSPRO',
 'max_io_loglength': 1024,
 'mongodb_url': 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot',
 'mpi_launch_method': 'APRUN',
 'pilot_id': 'pilot.0000',
 'runtime': 20,
 'scheduler': 'CONTINUOUS',
 'session_id': 'rp.session.Iains-MBP.home.ibethune.016835.0002',
 'spawner': 'POPEN',
 'staging_area': 'staging_area',
 'staging_scheme': 'staging',
 'task_launch_method': 'APRUN'}

FAILED startup
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 665, in <module>
    bootstrap_3()
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 641, in bootstrap_3
    pilot_FAILED(mongo_p, pilot_id, log, "FAILED startup")
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.Iains-MBP.home.ibethune.016835.0002-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 194, in pilot_FAILED
    print ru.get_trace()

bootstrap_3 done
atexit
vivek-bala commented 8 years ago

https://github.com/radical-cybertools/radical.pilot/issues/956

ibethune commented 8 years ago

This is fixed in RP devel branch.