radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Issues (?) when submitting jobs #147

Closed wjlei1990 closed 3 years ago

wjlei1990 commented 3 years ago

Hi I constantly encountered this issue when using entk on summit.

This issue doesn't not always happen but sometime it just pops out...I am not sure what is the reason...Could you help us to figure them out?

EnTK session: re.session.login3.lei.018844.0009
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.018844.0009]                               \
database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]            ok
create pilot manager                                                          ok
submit 1 pilot(s)Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 603, in submit_pilots
    pilot = Pilot(pmgr=self, descr=pd)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 109, in __init__
    self._resource_sandbox = self._session._get_resource_sandbox(pilot)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 739, in _get_resource_sandbox
    shell = self.get_js_shell(resource, schema)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 785, in get_js_shell
    shell = rsup.PTYShell(js_url, self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 247, in __init__
    interactive=self.interactive)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 206, in initialize
    self._initialize_pty(info['pty'], info)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 427, in _initialize_pty
    raise ptye.translate_exception (e) from e
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 300, in _initialize_pty
    raise rse.NoSuccess("Could not detect shell prompt (timeout)")
radical.saga.exceptions.NoSuccess: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 216, in submit_resource_request
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: Could not detect shell prompt (timeout) (/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +300 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_entk.py", line 102, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 468, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
wjlei1990 commented 3 years ago

Another error run:

EnTK session: re.session.login5.lei.018844.0010
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login5.lei.018844.0010]                               \
database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]            ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit          215040 cores    7680 gpus           ok
closing session re.session.login5.lei.018844.0010                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 59.4s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_entk.py", line 103, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
wjlei1990 commented 3 years ago
radical-stack

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.6.7
  radical.entk         : 1.6.7
  radical.gtod         : 1.6.7
  radical.pilot        : 1.6.7
  radical.saga         : 1.6.10
  radical.utils        : 1.6.7
wjlei1990 commented 3 years ago

This issues seems to be related to my .bashrc file.

I had a very lengthy bashrc that takes a while to load on Summit. I think today Summit is kind of slow so it tooke longer than before to load the bashrc file.

After cleaning the bashrc file a bit, I can get most of the jobs submitted.

Thanks for the help.

Let me know if you have updates in the future.

andre-merzky commented 3 years ago

@wjlei1990 - you could try to set this env variable on the client side, before running your script:

export RADICAL_SAGA_PTY_SSH_TIMEOUT=60

the default timeout is 10 seconds, which may be indeed too short if your bash startup takes too long.