vivek-bala / radical.entk

MIT License
1 stars 1 forks source link

EnTK breaks on Radical Two #22

Closed SrinivasMushnoori closed 6 years ago

SrinivasMushnoori commented 6 years ago

This script is the RepEx3.0 implementation in the EnTK API. It worked on my previous laptop as well as my office desktop, but breaks on Radical Two.

Radical Stack:

(EnTK_0.6_env) scm177@two:~/RepEx3.0$ radical-stack

python : 2.7.12 pythonpath : virtualenv : /home/scm177/VirtualEnvs/EnTK_0.6_env

radical.pilot : 0.47-v0.46.2-15-g62a193b5@devel radical.utils : 0.47-v0.46-10-gc515db1@devel saga : 0.47-v0.46-5-g74fc3811@devel

EnTK error:

2017-09-26 22:29:15,360: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: FAILED 2017-09-26 22:29:15,360: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: ERROR : Pilot has failed 2017-09-26 22:29:15,363: radical.entk.resource_manager: MainProcess : MainThread : ERROR : Resource request submission failed 2017-09-26 22:29:15,364: radical.entk.appmanager: MainProcess : MainThread : ERROR : Error in AppManager Traceback (most recent call last): File "/home/scm177/VirtualEnvs/EnTK_0.6_env/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 237, in run self._resource_manager._submit_resource_request() File "/home/scm177/VirtualEnvs/EnTK_0.6_env/local/lib/python2.7/site-packages/radical/entk/execman/resource_manager.py", line 344, in _submit_resource_request raise Exception Exception

wait for 1 pilot(s) ok closing session rp.session.two.scm177.017435.0010 \ close pilot manager \ wait for 1 pilot(s) timeout ok session lifetime: 12.0s ok Traceback (most recent call last): File "driver_EnTK06.py", line 115, in appman.run() File "/home/scm177/VirtualEnvs/EnTK_0.6_env/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 465, in run raise Error(text=ex) radical.entk.exceptions.Error: Error:

vivek-bala commented 6 years ago

Hey Srinivas,

I will need a few things to debug this. Can you provide me the following:

Also, are you able to any RP only scripts? I would suggest running any of the RP examples just to be sure, specifically since I see the pilot has failed.

SrinivasMushnoori commented 6 years ago

All files are compressed into a single compressed folder and attached.

RP only scripts are not running either, jobs are failing. The Pilot itself is submitted, but individual CU's fail. BugReportEnTKVivek.zip

vivek-bala commented 6 years ago

bootstrap_1.out:

Successfully installed radical.pilot
Cleaning up...
#
# SUCCESS
#
# -------------------------------------------------------------------
purge install source at radical.pilot-0.47-v0.46.2-15-g62a193b5-devel/
/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/bootstrap_1.sh: line 176: bc: command not found
0.0000,bootstrap_1,pilot.0000,PMGR_ACTIVE_PENDING,rp_install done,
/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/bootstrap_1.sh: line 176: bc: command not found
0.0000,bootstrap_1,pilot.0000,PMGR_ACTIVE_PENDING,virtenv_setup end,
verify python viability: /home/scm177/radical.pilot.sandbox/ve.local.localhost.0.47/bin/python ... ok
verify module viability: saga            ...Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/pilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.py", line 8, in <module>
    import radical.utils        as ru
  File "/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.py", line 11, in <module>
    from .plugin_manager import PluginManager
  File "/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import get_logger
  File "/home/scm177/radical.pilot.sandbox/rp.session.two.scm177.017436.0007/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/logger.py", line 118, in <module>
    import colorama
ImportError: No module named colorama
 failed
python installation cannot load module saga - abort

I think the bootstrapper is failing on the agent side. I would recommend opening a ticket in the RP repo with the logs from a failing RP example - that will bring this issue to other's attention.

Currently, I am not sure why the bootstrapper fails though. As a first, I will suggest deleting /home/scm177/radical.pilot.sandbox/ve.local.localhost.0.47 and trying again.

SrinivasMushnoori commented 6 years ago

Upon deleting the ve.localhost.0.47 directory I find that the entire pilot is stuck at:

2017-09-28 01:24:16,636: radical.entk.resource_manager: MainProcess : pmgr.0000.subscriber._state_sub_cb: INFO : Pilot pilot.0000 state: PMGR_ACTIVE_PENDING

for the past hour or so.

The bothersome thing is that it runs just fine on my older laptop. This seems to be limited to radical two?

vivek-bala commented 6 years ago

I would try just an rp example on radical two and try debug that first. It'll be simpler. I'm not sure why its happening right now though. Your previous session dump suggests the installation on the agent side failed. But this one sounds different...

Can you share the logs for this run?

SrinivasMushnoori commented 6 years ago

Hi, this did finally run. Slipped my mind to update this thread. Waited a long while before it finally ran but it did. Can be closed now.