radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK doesn't proceed for pipeline #44

Closed Weiming-Hu closed 6 years ago

Weiming-Hu commented 7 years ago

Hi,

This is the place where EnTK has got stuck for my last 3 times of trying...

[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->source ~/virtual-python/bin/activate
(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py 
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017455.0000]    \
database   : [mongodb://138.201.86.166:27017/ee_exp_4c]                       ok
create pilot manager                                                          ok
submit 1 pilot(s)
        .                                                                     ok
True
Active pipes:  1
WFP incomplete:  True
create unit manager                                                           ok
add 1 pilot(s)                                                                ok

After one hour, it still sits at the same place.

My allocation setting is up to date with the scripts in the repo if you'd like to check it.

Thank you.

vivek-bala commented 7 years ago

Hey Weiming.. Can you rerun it with the verbosity level on please? Also set export RADICAL_PILOT_VERBOSE=INFO. Let's pick this up when we talk today.

Weiming-Hu commented 7 years ago

OK. So I'll set the verbose info and try again. Thank you.

Weiming-Hu commented 7 years ago

This has been resolved with the help from Vivek.

Notes:

To restart the rabbitmq

The commands are as follows:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

I had a quick chat with Manuel here who is more familiar with docker. His advice was to create a new docker instance every time simply because stopping rabbitmq (when running inside a docker process) might kill the docker process itself. So I will recommend creating a new docker instance incase you see that issue again. The current command you follow randomly chooses the port number.

You can specify it to be 32773 as follows (be sure to kill the older docker instance):

docker run -d --name rabbit-1 -p 32773:5672 rabbitmq:3
Weiming-Hu commented 7 years ago

Hi Vivek, it looks like the same problem occurs again. I tried to stop/remove the rabbitMQ and created a new one, but it didn't work.

(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py 
2017-10-26 15:13:33,531: radical.pilot       : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:33,531: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      pid: 8843
2017-10-26 15:13:33,531: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      pid: 8843
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      pid: 8843
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-10-26 15:13:34,157: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource Manager initialized
2017-10-26 15:13:34,157: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource description validated
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      pid: 8843
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-10-26 15:13:34,264: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Application Manager initialized
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
2017-10-26 15:13:34,267: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager
2017-10-26 15:13:34,267: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Setting up RabbitMQ system
2017-10-26 15:13:34,495: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Starting resource request submission
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005]    \
database   : [mongodb://138.201.86.166:27017/ee_exp_4c]                       ok
create pilot manager                                                          ok
create pilot description [xsede.supermic:40]                                  ok
submit 1 pilot(s)
        .2017-10-26 15:13:42,053: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
                                                                     ok
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : radical.pilot.utils  version: 0.47-v0.46.2-16-g37aa40b@devel
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      pid: 8843
2017-10-26 15:13:42,056: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-10-26 15:13:42,056: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource request submission successful.. waiting for pilot to go Active
2017-10-26 15:13:42,059: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_LAUNCHING
2017-10-26 15:14:02,959: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING
2017-10-26 15:15:03,147: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_ACTIVE
2017-10-26 15:15:03,148: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: DONE
^Cclosing session rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005   \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 3341.9s                                                     ok
2017-10-26 16:09:17,037: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to exit callback thread gracefully...
2017-10-26 16:09:17,037: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2017-10-26 16:09:17,037: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Could not cancel resource request, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Execution failed, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Traceback (most recent call last):
  File "script_master.py", line 84, in <module>
    appman.run()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 431, in run
    self._resource_manager._cancel_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 374, in _cancel_resource_request
    self._pilot.cancel()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 526, in cancel
    self._pmgr.cancel_pilots(self.uid)
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 644, in cancel_pilots
    'uids' : uids}})
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1411, in publish
    raise RuntimeError("can't route '%s' notification: %s" % (pubsub, msg))
RuntimeError: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}

2017-10-26 16:09:17,038: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Could not cancel resource request, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Traceback (most recent call last):
  File "script_master.py", line 89, in <module>
    appman.resource_terminate()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 477, in resource_terminate
    self._resource_manager._cancel_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 374, in _cancel_resource_request
    self._pilot.cancel()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 526, in cancel
    self._pmgr.cancel_pilots(self.uid)
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 644, in cancel_pilots
    'uids' : uids}})
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1411, in publish
    raise RuntimeError("can't route '%s' notification: %s" % (pubsub, msg))
RuntimeError: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
vivek-bala commented 7 years ago

It seems like your walltime is short or 0?

2017-10-26 15:15:03,147: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: PMGR_ACTIVE
2017-10-26 15:15:03,148: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: DONE

Can you share the script that you are executing?

Weiming-Hu commented 7 years ago

https://github.com/radical-collaboration/hpc-workflows/blob/master/scripts/application_AnEn/anen_base/script_master.py

vivek-bala commented 7 years ago

The script looks ok. Can you reproduce this mutltiple times? Can you send me the pilot folder on superMIC with all its files and folders (radical.pilot.sandbox/rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005/)?

Weiming-Hu commented 7 years ago

Yes. I tried this multiple times and it all hangs...

Weiming-Hu commented 7 years ago

rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005.zip

vivek-bala commented 7 years ago

Of course. I think superMIC changed their default python distribution. Let me try a couple of fixes and get back to you on this.

Weiming-Hu commented 7 years ago

Sounds good. Thank you.

vivek-bala commented 7 years ago

Resolved by https://github.com/radical-cybertools/radical.pilot/pull/1482

Weiming-Hu commented 7 years ago

Do I need to update anything? Or just rerun the script?

vivek-bala commented 7 years ago

I have created a pull request for now. You can probably checkout (fix/python-on-supermic) branch in RP and reinstall it (pip install . --upgrade). Then you should be able to run your scripts.

Weiming-Hu commented 7 years ago

Nice! It's proceeding. Thank you.

vivek-bala commented 7 years ago

Great! Feel free to use more than 40 cores as well!

Weiming-Hu commented 7 years ago

Unfortunately, it hangs after the pre processing. The output files are correctly generated. but EnTK doesn't go on to the iterative computation..... I'm going to attach the rp files..

Weiming-Hu commented 7 years ago

rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0008.zip

Weiming-Hu commented 7 years ago
...
submit 1 unit(s)
        .Syncing task radical.entk.task.0002 with state SCHEDULED
Synced task radical.entk.task.0002 with state SCHEDULED
Syncing task radical.entk.task.0005 with state SCHEDULING
Synced task radical.entk.task.0005 with state SCHEDULING
                                                                     ok
Syncing task radical.entk.task.0005 with state SCHEDULED
Synced task radical.entk.task.0005 with state SCHEDULED
Syncing task radical.entk.task.0006 with state SUBMITTED
Synced task radical.entk.task.0006 with state SUBMITTED
Syncing task radical.entk.task.0007 with state SUBMITTING
Synced task radical.entk.task.0007 with state SUBMITTING
submit 1 unit(s)
        .State transition done
                                                                     ok
Syncing task radical.entk.task.0007 with state SUBMITTED
Synced task radical.entk.task.0007 with state SUBMITTED
Syncing task radical.entk.task.0001 with state SUBMITTING
Synced task radical.entk.task.0001 with state SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Syncing task radical.entk.task.0001 with state SUBMITTED
Synced task radical.entk.task.0001 with state SUBMITTED
Syncing task radical.entk.task.0004 with state SUBMITTING
Synced task radical.entk.task.0004 with state SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Syncing task radical.entk.task.0004 with state SUBMITTED
Synced task radical.entk.task.0004 with state SUBMITTED
Syncing task radical.entk.task.0003 with state SUBMITTING
Synced task radical.entk.task.0003 with state SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Syncing task radical.entk.task.0003 with state SUBMITTED
Synced task radical.entk.task.0003 with state SUBMITTED
Syncing task radical.entk.task.0002 with state SUBMITTING
Synced task radical.entk.task.0002 with state SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Syncing task radical.entk.task.0002 with state SUBMITTED
Synced task radical.entk.task.0002 with state SUBMITTED
Syncing task radical.entk.task.0005 with state SUBMITTING
Synced task radical.entk.task.0005 with state SUBMITTING
submit 1 unit(s)
        .                                                                     ok
Syncing task radical.entk.task.0005 with state SUBMITTED
Synced task radical.entk.task.0005 with state SUBMITTED
Weiming-Hu commented 7 years ago

This is the local rp session folder. rp.session.zip

vivek-bala commented 7 years ago

@andre-merzky I might need your help here. The tasks reach the AGENT_STAGING_OUTPUT state but not any further. Weiming uploaded both the client and the agent logs. There staging only in the CUs. Do you see anything going wrong here?

Thanks

Weiming-Hu commented 7 years ago

Hey Vivek. Just would like to check if you have any updates for me in case I missed them. Thank you.

andre-merzky commented 6 years ago

I am sorry, I missed the ping :(

This looks like a version problem. The umgr log shows this exception:

2017-10-30 16:46:01,111: umgr.0000           : task-manager                    : umgr.0000.subscriber._state_sub_cb: ERROR   : abort: TypeError("advance() got an unexpected keyword argument 'prof'",)
Traceback (most recent call last):
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run
    if not self.work_cb():
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1097, in work_cb
    ret = self._cb(topic=topic, msg=m)
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 455, in _state_sub_cb
    if not self._update_unit(thing, publish=False):
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 491, in _update_unit
    prof=False)
TypeError: advance() got an unexpected keyword argument 'prof'

From this point on, the unit manager will not get any state updates for completed units anymore. Please make sure that RP and RU are in sync, but please let me know if the problem persists. In that case, please include the output of the radical-stack command. Thanks!

Weiming-Hu commented 6 years ago

Thank you for your input. Since this issue has been here for a while, I'm kind of lost here. The issue is still there. So what should I do? Should I entirely reinstall ENTK?

andre-merzky commented 6 years ago

Yes, you probably should recreate the whole virtualenv.

@vivek-bala, can you advice please what stack would be usable for this workload?

Weiming-Hu commented 6 years ago

Per your instructions, I recreated the virtualenv, and reinstalled the RADICAL toolset from scratch. I also renewed my certificate to SuperMIC.

But I got an error.

(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py                                                            
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017497.0000]    \
database   : [mongodb://138.201.86.166:27017/ee_exp_4c]                       ok
create pilot manager                                                          ok
submit 1 pilot(s)
        .                                                                     ok
2017-11-27 16:55:41,695: radical.entk.resource_manager: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2017-11-27 16:55:41,754: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Resource request submission failed
2017-11-27 16:55:41,755: radical.entk.appmanager: MainProcess                     : MainThread     : ERROR   : Error in AppManager
Traceback (most recent call last):
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 241, in run
    self._resource_manager._submit_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 344, in _submit_resource_request
    raise Exception
Exception

wait for 1 pilot(s)
                                                                              ok
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
session lifetime: 36.1s                                                       ok
Execution failed, error: Error: 
Traceback (most recent call last):
  File "script_master.py", line 84, in <module>
    appman.run()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 469, in run
    raise Error(text=ex)
Error: Error: 

2017-11-27 16:55:46,009: radical.entk.resource_manager: MainProcess                     : MainThread     : ERROR   : Could not cancel resource request, error: can't route 'control_pubsub' notification: []
Traceback (most recent call last):
  File "script_master.py", line 89, in <module>
    appman.resource_terminate()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 480, in resource_terminate
    self._resource_manager._cancel_resource_request()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 372, in _cancel_resource_request
    self._pilot.cancel()
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 544, in cancel
    self._pmgr.cancel_pilots(self.uid)
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 668, in cancel_pilots
    'uids' : uids}})
  File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1431, in publish
    self._publishers.keys()))
RuntimeError: can't route 'control_pubsub' notification: []
Weiming-Hu commented 6 years ago

This has been resolved thanks to Vivek. Two issues were resolved: