Closed Weiming-Hu closed 6 years ago
Hey Weiming.. Can you rerun it with the verbosity level on please? Also set export RADICAL_PILOT_VERBOSE=INFO
. Let's pick this up when we talk today.
OK. So I'll set the verbose info and try again. Thank you.
This has been resolved with the help from Vivek.
Notes:
To restart the rabbitmq
The commands are as follows:
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app
I had a quick chat with Manuel here who is more familiar with docker. His advice was to create a new docker instance every time simply because stopping rabbitmq (when running inside a docker process) might kill the docker process itself. So I will recommend creating a new docker instance incase you see that issue again. The current command you follow randomly chooses the port number.
You can specify it to be 32773 as follows (be sure to kill the older docker instance):
docker run -d --name rabbit-1 -p 32773:5672 rabbitmq:3
Hi Vivek, it looks like the same problem occurs again. I tried to stop/remove the rabbitMQ and created a new one, but it didn't work.
(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py
2017-10-26 15:13:33,531: radical.pilot : MainProcess : MainThread : INFO : python.interpreter version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:33,531: radical.pilot : MainProcess : MainThread : INFO : pid: 8843
2017-10-26 15:13:33,531: radical.pilot : MainProcess : MainThread : INFO : tid: MainThread
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess : MainThread : INFO : python.interpreter version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess : MainThread : INFO : pid: 8843
2017-10-26 15:13:33,677: radical.entk.task_processor: MainProcess : MainThread : INFO : tid: MainThread
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess : MainThread : INFO : python.interpreter version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess : MainThread : INFO : pid: 8843
2017-10-26 15:13:34,051: radical.entk.resource_manager: MainProcess : MainThread : INFO : tid: MainThread
2017-10-26 15:13:34,157: radical.entk.resource_manager: MainProcess : MainThread : INFO : Resource Manager initialized
2017-10-26 15:13:34,157: radical.entk.resource_manager: MainProcess : MainThread : INFO : Resource description validated
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess : MainThread : INFO : python.interpreter version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess : MainThread : INFO : pid: 8843
2017-10-26 15:13:34,158: radical.entk.appmanager: MainProcess : MainThread : INFO : tid: MainThread
2017-10-26 15:13:34,264: radical.entk.appmanager: MainProcess : MainThread : INFO : Application Manager initialized
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
2017-10-26 15:13:34,267: radical.entk.appmanager: MainProcess : MainThread : INFO : Workflow assigned to Application Manager
2017-10-26 15:13:34,267: radical.entk.appmanager: MainProcess : MainThread : INFO : Setting up RabbitMQ system
2017-10-26 15:13:34,495: radical.entk.appmanager: MainProcess : MainThread : INFO : Starting resource request submission
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005] \
database : [mongodb://138.201.86.166:27017/ee_exp_4c] ok
create pilot manager ok
create pilot description [xsede.supermic:40] ok
submit 1 pilot(s)
.2017-10-26 15:13:42,053: radical.entk.resource_manager: MainProcess : MainThread : INFO : Pilot pilot.0000 state: PMGR_LAUNCHING_PENDING
ok
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess : MainThread : INFO : python.interpreter version: 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess : MainThread : INFO : radical.pilot.utils version: 0.47-v0.46.2-16-g37aa40b@devel
2017-10-26 15:13:42,055: radical.pilot.utils : MainProcess : MainThread : INFO : pid: 8843
2017-10-26 15:13:42,056: radical.pilot.utils : MainProcess : MainThread : INFO : tid: MainThread
2017-10-26 15:13:42,056: radical.entk.resource_manager: MainProcess : MainThread : INFO : Resource request submission successful.. waiting for pilot to go Active
2017-10-26 15:13:42,059: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_LAUNCHING
2017-10-26 15:14:02,959: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING
2017-10-26 15:15:03,147: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE
2017-10-26 15:15:03,148: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: DONE
^Cclosing session rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005 \
close pilot manager \
wait for 1 pilot(s)
timeout
ok
session lifetime: 3341.9s ok
2017-10-26 16:09:17,037: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to exit callback thread gracefully...
2017-10-26 16:09:17,037: radical.entk.appmanager: MainProcess : MainThread : ERROR : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
2017-10-26 16:09:17,037: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Could not cancel resource request, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Execution failed, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Traceback (most recent call last):
File "script_master.py", line 84, in <module>
appman.run()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 431, in run
self._resource_manager._cancel_resource_request()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 374, in _cancel_resource_request
self._pilot.cancel()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 526, in cancel
self._pmgr.cancel_pilots(self.uid)
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 644, in cancel_pilots
'uids' : uids}})
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1411, in publish
raise RuntimeError("can't route '%s' notification: %s" % (pubsub, msg))
RuntimeError: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
2017-10-26 16:09:17,038: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Could not cancel resource request, error: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
Traceback (most recent call last):
File "script_master.py", line 89, in <module>
appman.resource_terminate()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 477, in resource_terminate
self._resource_manager._cancel_resource_request()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 374, in _cancel_resource_request
self._pilot.cancel()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 526, in cancel
self._pmgr.cancel_pilots(self.uid)
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 644, in cancel_pilots
'uids' : uids}})
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1411, in publish
raise RuntimeError("can't route '%s' notification: %s" % (pubsub, msg))
RuntimeError: can't route 'control_pubsub' notification: {'cmd': 'cancel_pilots', 'arg': {'uids': ['pilot.0000'], 'pmgr': 'pmgr.0000'}}
It seems like your walltime is short or 0?
2017-10-26 15:15:03,147: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE
2017-10-26 15:15:03,148: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: DONE
Can you share the script that you are executing?
The script looks ok. Can you reproduce this mutltiple times? Can you send me the pilot folder on superMIC with all its files and folders (radical.pilot.sandbox/rp.session.js-156-59.jetstream-cloud.org.weiming.017465.0005/)?
Yes. I tried this multiple times and it all hangs...
Of course. I think superMIC changed their default python distribution. Let me try a couple of fixes and get back to you on this.
Sounds good. Thank you.
Do I need to update anything? Or just rerun the script?
I have created a pull request for now. You can probably checkout (fix/python-on-supermic) branch in RP and reinstall it (pip install . --upgrade). Then you should be able to run your scripts.
Nice! It's proceeding. Thank you.
Great! Feel free to use more than 40 cores as well!
Unfortunately, it hangs after the pre processing. The output files are correctly generated. but EnTK doesn't go on to the iterative computation..... I'm going to attach the rp files..
...
submit 1 unit(s)
.Syncing task radical.entk.task.0002 with state SCHEDULED
Synced task radical.entk.task.0002 with state SCHEDULED
Syncing task radical.entk.task.0005 with state SCHEDULING
Synced task radical.entk.task.0005 with state SCHEDULING
ok
Syncing task radical.entk.task.0005 with state SCHEDULED
Synced task radical.entk.task.0005 with state SCHEDULED
Syncing task radical.entk.task.0006 with state SUBMITTED
Synced task radical.entk.task.0006 with state SUBMITTED
Syncing task radical.entk.task.0007 with state SUBMITTING
Synced task radical.entk.task.0007 with state SUBMITTING
submit 1 unit(s)
.State transition done
ok
Syncing task radical.entk.task.0007 with state SUBMITTED
Synced task radical.entk.task.0007 with state SUBMITTED
Syncing task radical.entk.task.0001 with state SUBMITTING
Synced task radical.entk.task.0001 with state SUBMITTING
submit 1 unit(s)
. ok
Syncing task radical.entk.task.0001 with state SUBMITTED
Synced task radical.entk.task.0001 with state SUBMITTED
Syncing task radical.entk.task.0004 with state SUBMITTING
Synced task radical.entk.task.0004 with state SUBMITTING
submit 1 unit(s)
. ok
Syncing task radical.entk.task.0004 with state SUBMITTED
Synced task radical.entk.task.0004 with state SUBMITTED
Syncing task radical.entk.task.0003 with state SUBMITTING
Synced task radical.entk.task.0003 with state SUBMITTING
submit 1 unit(s)
. ok
Syncing task radical.entk.task.0003 with state SUBMITTED
Synced task radical.entk.task.0003 with state SUBMITTED
Syncing task radical.entk.task.0002 with state SUBMITTING
Synced task radical.entk.task.0002 with state SUBMITTING
submit 1 unit(s)
. ok
Syncing task radical.entk.task.0002 with state SUBMITTED
Synced task radical.entk.task.0002 with state SUBMITTED
Syncing task radical.entk.task.0005 with state SUBMITTING
Synced task radical.entk.task.0005 with state SUBMITTING
submit 1 unit(s)
. ok
Syncing task radical.entk.task.0005 with state SUBMITTED
Synced task radical.entk.task.0005 with state SUBMITTED
This is the local rp session folder. rp.session.zip
@andre-merzky I might need your help here. The tasks reach the AGENT_STAGING_OUTPUT state but not any further. Weiming uploaded both the client and the agent logs. There staging only in the CUs. Do you see anything going wrong here?
Thanks
Hey Vivek. Just would like to check if you have any updates for me in case I missed them. Thank you.
I am sorry, I missed the ping :(
This looks like a version problem. The umgr log shows this exception:
2017-10-30 16:46:01,111: umgr.0000 : task-manager : umgr.0000.subscriber._state_sub_cb: ERROR : abort: TypeError("advance() got an unexpected keyword argument 'prof'",)
Traceback (most recent call last):
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run
if not self.work_cb():
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1097, in work_cb
ret = self._cb(topic=topic, msg=m)
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 455, in _state_sub_cb
if not self._update_unit(thing, publish=False):
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 491, in _update_unit
prof=False)
TypeError: advance() got an unexpected keyword argument 'prof'
From this point on, the unit manager will not get any state updates for completed units anymore. Please make sure that RP and RU are in sync, but please let me know if the problem persists. In that case, please include the output of the radical-stack
command. Thanks!
Thank you for your input. Since this issue has been here for a while, I'm kind of lost here. The issue is still there. So what should I do? Should I entirely reinstall ENTK?
Yes, you probably should recreate the whole virtualenv.
@vivek-bala, can you advice please what stack would be usable for this workload?
Per your instructions, I recreated the virtualenv, and reinstalled the RADICAL toolset from scratch. I also renewed my certificate to SuperMIC.
But I got an error.
(virtual-python)[js-156-59] weiming ~/github/hpc-workflows/scripts/application_AnEn/anen_base-->python script_master.py
Create a task for generating observation raster at time 1 flt 1
Create a task for generating observation raster at time 1 flt 2
Create a task for generating observation raster at time 1 flt 3
Create a task for generating observation raster at time 1 flt 4
Create a task for generating observation raster at time 2 flt 1
Create a task for generating observation raster at time 2 flt 2
Create a task for generating observation raster at time 2 flt 3
Create a task for generating observation raster at time 2 flt 4
new session: [rp.session.js-156-59.jetstream-cloud.org.weiming.017497.0000] \
database : [mongodb://138.201.86.166:27017/ee_exp_4c] ok
create pilot manager ok
submit 1 pilot(s)
. ok
2017-11-27 16:55:41,695: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed
2017-11-27 16:55:41,754: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Resource request submission failed
2017-11-27 16:55:41,755: radical.entk.appmanager: MainProcess : MainThread : ERROR : Error in AppManager
Traceback (most recent call last):
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 241, in run
self._resource_manager._submit_resource_request()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 344, in _submit_resource_request
raise Exception
Exception
wait for 1 pilot(s)
ok
close pilot manager \
wait for 1 pilot(s)
timeout
ok
session lifetime: 36.1s ok
Execution failed, error: Error:
Traceback (most recent call last):
File "script_master.py", line 84, in <module>
appman.run()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 469, in run
raise Error(text=ex)
Error: Error:
2017-11-27 16:55:46,009: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Could not cancel resource request, error: can't route 'control_pubsub' notification: []
Traceback (most recent call last):
File "script_master.py", line 89, in <module>
appman.resource_terminate()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 480, in resource_terminate
self._resource_manager._cancel_resource_request()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 372, in _cancel_resource_request
self._pilot.cancel()
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/compute_pilot.py", line 544, in cancel
self._pmgr.cancel_pilots(self.uid)
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 668, in cancel_pilots
'uids' : uids}})
File "/home/weiming/virtual-python/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1431, in publish
self._publishers.keys()))
RuntimeError: can't route 'control_pubsub' notification: []
This has been resolved thanks to Vivek. Two issues were resolved:
module load python
needs to be edited in the virtual environment
Hi,
This is the place where EnTK has got stuck for my last 3 times of trying...
After one hour, it still sits at the same place.
My allocation setting is up to date with the scripts in the repo if you'd like to check it.
Thank you.