radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

Pilot has failed #111

Closed euhruska closed 5 years ago

euhruska commented 5 years ago

Got an issue with Could not detect shell prompt (timeout), any indications what a timeout happened? Local directory re.session.leonardo.rice.edu.eh22.017910.0004.zip

2019-01-18 08:44:30,568: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-18 08:44:30,569: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-18 08:44:31,057: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-18 08:44:31,057: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-18 08:46:30,707: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-18 08:46:30,708: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-18 08:46:31,111: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Received heartbeat request
2019-01-18 08:46:31,112: radical.entk.task_manager.0000: task-manager                    : MainThread     : INFO    : Sent heartbeat response
2019-01-18 08:48:13,705: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating WFprocessor
2019-01-18 08:48:13,705: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : Pilot pilot.0000 state: FAILED
2019-01-18 08:48:13,705: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : DEBUG   : Attempting to end WFprocessor... event: False
2019-01-18 08:48:13,706: radical.entk.resource_manager.0000: MainProcess                     : pmgr.0000.subscriber._state_sub_cb: ERROR   : Pilot has failed
2019-01-18 08:48:13,706: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-thread
2019-01-18 08:48:13,706: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2019-01-18 08:48:13,992: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread
2019-01-18 08:48:14,120: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
2019-01-18 08:48:14,416: radical.entk.wfprocessor.0000: MainProcess                     : MainThread     : DEBUG   : WFprocessor process terminated
2019-01-18 08:48:14,418: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating synchronizer thread
2019-01-18 08:48:14,795: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Synchronizer thread terminated
2019-01-18 08:48:14,795: radical.entk.appmanager.0000: MainProcess                     : MainThread     : INFO    : Terminating task manager process
2019-01-18 08:48:30,847: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2019-01-18 08:48:30,847: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2019-01-18 08:48:31,141: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Task manager process closed
2019-01-18 08:50:31,260: radical.entk.task_manager.0000: MainProcess                     : MainThread     : INFO    : Hearbeat thread terminated
2019-01-18 08:50:39,744: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm create  pool   for gsisftp://bw.ncsa.illinois.edu/shell_file_adaptor_command_shell/ (<type 'str'>) (<radical.utils.lease_manager.LeaseManager object at 0x7f0953443190>)
2019-01-18 08:50:39,744: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm create  object for gsisftp://bw.ncsa.illinois.edu/shell_file_adaptor_command_shell/
2019-01-18 08:55:41,565: radical.utils       : MainProcess                     : MainThread     : ERROR   : Could not create lease object
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/utils/lease_manager.py", line 175, in _create_object
    obj = _LeaseObject (self, self._log, creator, args)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/utils/lease_manager.py", line 33, in __init__
    self.obj        = creator (*args)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/saga/adaptors/shell/shell_file.py", line 284, in _shell_creator
    return sups.PTYShell(url, self.get_session(), self._logger)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 248, in __init__
    self.pty_shell  = self.factory.run_shell  (self.pty_info)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 450, in run_shell
    self._initialize_pty (sh_slave, info)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 413, in _initialize_pty
    raise ptye.translate_exception (e)
NoSuccess: Could not detect shell prompt (timeout) (/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py +290 (_initialize_pty)  :  raise se.NoSuccess ("Could not detect shell prompt (timeout)"))
2019-01-18 08:55:42,009: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm create  object for gsisftp://bw.ncsa.illinois.edu/shell_file_adaptor_command_shell/

Maybe separate issue, when I fetched the logfiles I got logiles tarball doesnt exist


radical-pilot-fetch-logfiles re.session.leonardo.rice.edu.eh22.017910.0004
2019-01-18 14:40:37,032: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 | packaged by conda-forge | (default, Mar 30 2018, 18:16:04) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
2019-01-18 14:40:37,032: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : radical.pilot.utils  version: 0.50.12-v0.50.12@HEAD-detached-at-v0.50.12
2019-01-18 14:40:37,033: radical.pilot.utils : MainProcess                     : MainThread     : INFO    :                      pid/tid: 6016/MainThread
2019-01-18 14:40:37,442: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : Session: re.session.leonardo.rice.edu.eh22.017910.0004
2019-01-18 14:40:37,443: radical.pilot.utils : MainProcess                     : MainThread     : INFO    : Number of pilots in session: 1
2019-01-18 14:40:37,486: radical.utils       : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 | packaged by conda-forge | (default, Mar 30 2018, 18:16:04) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
2019-01-18 14:40:37,487: radical.utils       : MainProcess                     : MainThread     : INFO    : radical.utils        version: 0.50.1-v0.50.1-3-g2b7f6c6@devel
2019-01-18 14:40:37,487: radical.utils       : MainProcess                     : MainThread     : INFO    :                      pid/tid: 6016/MainThread
2019-01-18 14:40:37,487: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm new manager
2019-01-18 14:40:37,533: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm create  pool   for gsisftp://bw.ncsa.illinois.edu/shell_file_adaptor_command_shell/ (<type 'str'>) (<radical.utils.lease_manager.LeaseManager object at 0x7f6225d95550>)
2019-01-18 14:40:37,533: radical.utils       : MainProcess                     : MainThread     : DEBUG   : lm create  object for gsisftp://bw.ncsa.illinois.edu/shell_file_adaptor_command_shell/
2019-01-18 14:40:42,687: radical.pilot.utils : MainProcess                     : MainThread     : WARNING : logiles tarball doesnt exists
andre-merzky commented 5 years ago

Hi Eugen - the issues (pilot fails, fetch fails) might be related. The message Could not detect shell prompt (timeout) really means exactly that: Some layer attempted to connect to the target host, but could not find a prompt when opening a shell. So, either the connection got stuck in some way completely, or the target system was slow (memory swapping, file system issues, etc) that the shell prompt detection timed out.

I would like to ask you to try once more. If the same problem persists, can you please confirm you can login to the machine interactively? What is the target machine?

Thanks, Andre.

euhruska commented 5 years ago

I can login to bluewaters no problem. I restarted the client host and it looks like it works now. Maybe just didn't clean properly on client host side. I will reopen if this happens again