radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK Pilot has failed #49

Closed Weiming-Hu closed 6 years ago

Weiming-Hu commented 6 years ago

When I tried to run the 'script_master.py' script again, I had an error.

(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py 
2018-01-10 15:20:12,964: radical.pilot       : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2018-01-10 15:20:12,965: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      pid: 21287
2018-01-10 15:20:12,965: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2018-01-10 15:20:13,068: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2018-01-10 15:20:13,068: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      pid: 21287
2018-01-10 15:20:13,068: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2018-01-10 15:20:13,153: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2018-01-10 15:20:13,153: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      pid: 21287
2018-01-10 15:20:13,153: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2018-01-10 15:20:13,263: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource Manager initialized
2018-01-10 15:20:13,263: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource description validated
2018-01-10 15:20:13,264: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2018-01-10 15:20:13,264: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      pid: 21287
2018-01-10 15:20:13,264: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2018-01-10 15:20:13,389: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Application Manager initialized
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
2018-01-10 15:20:13,390: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager
2018-01-10 15:20:13,391: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Setting up RabbitMQ system
2018-01-10 15:20:13,560: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Starting resource request submission
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017541.0000]    \
database   : [mongodb://138.201.86.166:27017/ee_exp_4c]                       ok
create pilot manager                                                          ok
create pilot description [xsede.supermic:40]                                  ok
submit 1 pilot(s)
        .2018-01-10 15:20:22,112: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
                                                                     ok
2018-01-10 15:20:22,114: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2018-01-10 15:20:22,115: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : radical.pilot.utils  version: 0.46.2
2018-01-10 15:20:22,115: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      pid: 21287
2018-01-10 15:20:22,115: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2018-01-10 15:20:22,115: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource request submission successful.. waiting for pilot to go Active
2018-01-10 15:20:22,118: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING
2018-01-10 15:20:41,872: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: FAILED
2018-01-10 15:20:41,872: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2018-01-10 15:20:41,962: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Resource request submission failed
2018-01-10 15:20:41,962: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Error in AppManager
Traceback (most recent call last):
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 241, in run
    self._resource_manager._submit_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 344, in _submit_resource_request
    raise Exception
Exception

wait for 1 pilot(s)
                                                                              ok
closing session rp.session.js-156-59.jetstream-cloud.org.weiming.017541.0000   \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 29.2s                                                       ok
Execution failed, error: Error: 
Traceback (most recent call last):
  File "script_master.py", line 84, in <module>
    appman.run()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 469, in run
    raise Error(text=ex)
Error: Error: 

2018-01-10 15:20:43,154: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Could not cancel resource request, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Traceback (most recent call last):
  File "script_master.py", line 89, in <module>
    appman.resource_terminate()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 480, in resource_terminate
    self._resource_manager._cancel_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 372, in _cancel_resource_request
    self._pilot.cancel()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 526, in cancel
    self._pmgr.cancel_pilots(self.uid)
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 644, in cancel_pilots
    'uids' : uids}})
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1411, in publish
    raise RuntimeError("can't route '%s' notification: %s" % (pubsub, msg))
RuntimeError: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
vivek-bala commented 6 years ago

Not sure if we discussed this in the last call. But could you try again with a new virtual environment?

Weiming-Hu commented 6 years ago

OK. It has been a while for me too. Let me follow up on this. Thank you...

Weiming-Hu commented 6 years ago

I'm going to close it for now. Once the modified algorithm is ready from my side, I'll check if it still exists.