radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Execution hangs after "Found SSH context" #164

Closed andre-merzky closed 10 years ago

andre-merzky commented 10 years ago

In the mpi branch, I see frequent hangs when running simple examples:

2014:06:06 00:27:56 27758  MainThread   saga.saga.Session     : [DEBUG   ] adding  context : {'Type' : 'SSH'}
2014:06:06 00:27:56 radical.pilot.PilotLauncherWorker-1: [DEBUG   ] Connected to MongoDB. Serving requests for PilotManager 5390eee974df926c6e4a04e3.
2014:06:06 00:27:56 27763  MainThread   saga.saga.Session     : [DEBUG   ] adding  context : {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}
2014:06:06 00:27:56 radical.pilot.PilotLauncherWorker-1: [DEBUG   ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}.
^CTraceback (most recent call last):
  File "mpi_india_test.py", line 80, in <module>
Process PilotLauncherWorker-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    pilot = pmgr.submit_pilots(pdesc)
  File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/pilot_manager.py", line 313, in submit_pilots
    self.run()
  File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/controller/pilot_launcher_worker.py", line 160, in run
    time.sleep(1)
KeyboardInterrupt
    session=self._session)
  File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/controller/pilot_manager_controller.py", line 327, in register_start_pilot_request
    out, err = p.communicate()
  File "/usr/lib/python2.7/subprocess.py", line 799, in communicate
    return self._communicate(input)
  File "/usr/lib/python2.7/subprocess.py", line 1401, in _communicate
    stdout, stderr = self._communicate_with_poll(input)
  File "/usr/lib/python2.7/subprocess.py", line 1455, in _communicate_with_poll
    ready = poller.poll()
KeyboardInterrupt

Subsequent runs often work ok.

marksantcroos commented 10 years ago

Never saw this in devel or elsewhere?

On 06 Jun 2014, at 0:35 , Andre Merzky notifications@github.com wrote:

In the mpi branch, I see frequent hangs when running simple examples:

2014:06:06 00:27:56 27758 MainThread saga.saga.Session : [DEBUG ] adding context : {'Type' : 'SSH'} 2014:06:06 00:27:56 radical.pilot.PilotLauncherWorker-1: [DEBUG ] Connected to MongoDB. Serving requests for PilotManager 5390eee974df926c6e4a04e3. 2014:06:06 00:27:56 27763 MainThread saga.saga.Session : [DEBUG ] adding context : {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'} 2014:06:06 00:27:56 radical.pilot.PilotLauncherWorker-1: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. ^CTraceback (most recent call last): File "mpi_india_test.py", line 80, in Process PilotLauncherWorker-1: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap pilot = pmgr.submit_pilots(pdesc) File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/pilot_manager.py", line 313, in submit_pilots self.run() File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/controller/pilot_launcher_worker.py", line 160, in run time.sleep(1) KeyboardInterrupt session=self._session) File "/home/merzky/ve/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/controller/pilot_manager_controller.py", line 327, in register_start_pilot_request out, err = p.communicate() File "/usr/lib/python2.7/subprocess.py", line 799, in communicate return self._communicate(input) File "/usr/lib/python2.7/subprocess.py", line 1401, in _communicate stdout, stderr = self._communicate_with_poll(input) File "/usr/lib/python2.7/subprocess.py", line 1455, in _communicate_with_poll ready = poller.poll() KeyboardInterrupt

Subsequent runs often work ok.

— Reply to this email directly or view it on GitHub.

andre-merzky commented 10 years ago

Not that I can think of -- but also, I did not test devel for a while. I would not mind to see if that simply disappears after the merge...

marksantcroos commented 10 years ago

Its hard to imagine how this could be specific to the mpi branch, but one never knows ...

On 06 Jun 2014, at 17:26 , Andre Merzky notifications@github.com wrote:

Not that I can think of -- but also, I did not test devel for a while. I would not mind to see if that simply disappears after the merge...

— Reply to this email directly or view it on GitHub.

andre-merzky commented 10 years ago

agree...

oleweidner commented 10 years ago

I have seen this also every now and then... especially in the context of bad network connection. It seems to be this line hanging:

                    p = subprocess.Popen(
                        ["ssh", url,  "echo -n PWD: && pwd"],
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE
                    )
                    out, err = p.communicate()

I agree that this is UGLY to begin with. If someone feels eager to replace with something more reasonable, please go ahead ;-) This might help!

andre-merzky commented 10 years ago

Then the hang may be on ssh level -- possibly asking for user input or something. p.communicate() will not work if ssh goes into any interaction mode...

marksantcroos commented 10 years ago

On 08 Jun 2014, at 16:25 , Ole Weidner notifications@github.com wrote:

I have seen this also every now and then... especially in the context of bad network connection. It seems to be this line hanging:

                p = subprocess.Popen(
                    ["ssh", url,  "echo -n PWD: && pwd"],
                    stdout=subprocess.PIPE, stderr=subprocess.PIPE
                )
                out, err = p.communicate()

I agree that this is UGLY to begin with. If someone feels eager to replace with something more reasonable, please go ahead ;-) This might help!

Shall we fix this once we merge the OSG branch back in? Because this is obviously not a future proof construct.

ps. Although merging might not be the right word given the many changes since then.

andre-merzky commented 10 years ago

Sure...

andre-merzky commented 10 years ago

Related: I can't run remote tests from jenkins,as I see:

2014:06:08 22:16:25 MainThread   saga.saga.Session     : [DEBUG   ] adding  context : {'Type' : 'SSH'}
2014:06:08 22:16:25 radical.pilot.MainProcess: [WARNING ] Couldn't determine remote working directory for india.futuregrid.org: 
2014:06:08 22:16:26 radical.pilot.MainProcess: [WARNING ] Couldn't determine remote working directory for india.futuregrid.org: 
2014:06:08 22:16:26 radical.pilot.MainProcess: [ERROR   ] Couldn't determine remote working directory.
Traceback (most recent call last):
  File "tests/mpi_india_test.py", line 80, in <module>
    pilot = pmgr.submit_pilots(pdesc)
  File "/var/lib/jenkins/workspace/radical.pilot/.pyenv/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 313, in submit_pilots
    session=self._session)
  File "/var/lib/jenkins/workspace/radical.pilot/.pyenv/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 341, in register_start_pilot_request
    raise Exception(error_msg)
Exception: Couldn't determine remote working directory.

which is, I think, caused by the above snippet not supporting context information on the session...

andre-merzky commented 10 years ago

Moved to AM, fix via PTY call

andre-merzky commented 10 years ago

Fixed in 06ce509 on devel

marksantcroos commented 10 years ago

I was assuming that this would safe me one use of my yubikey but that is not true. So turning this into a saga pty call doesnt mean that there is connection sharing?

andre-merzky commented 10 years ago

Hmm, you should see connection sharing, i.e. only the first hop should check the yubikey -- this would be the PWD check though. What is it you see?

marksantcroos commented 10 years ago

On 11 Jun 2014, at 23:15 , Andre Merzky notifications@github.com wrote:

Hmm, you should see connection sharing, i.e. only the first hop should check the yubikey -- this would be the PWD check though. What is it you see?

$ python mpi_yellowstone_test.py 2014:06:11 23:21:19 radical.pilot.MainProcess: [INFO ] radical.pilot version: 0.14 2014:06:11 23:21:23 radical.pilot.MainProcess: [INFO ] New Session created{'database_url': 'mongodb://ec2-184-72-89-141.compute-1.amazonaws.com:27017/', 'database_name': 'radicalpilot', 'last_reconnect': None, 'uid': '5398c84fa85378300d14bf32', 'created': datetime.datetime(2014, 6, 11, 21, 21, 19, 931323)}. 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Added credential {'user_id': None, 'type': 'SSH', 'user_pass': None, 'user_key': None} to session 5398c84fa85378300d14bf32. 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/archer.json 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/das4.json 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/futuregrid.json 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/localhost.json 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/ncar.json 2014:06:11 23:21:24 radical.pilot.MainProcess: [INFO ] Loaded resource configurations from file://localhost//Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/configs/xsede.json 2014:06:11 23:21:25 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-1[4380954624]) for PilotManager 5398c854a85378300d14bf33 started. 2014:06:11 23:21:25 radical.pilot.MainProcess: [DEBUG ] PTYShell init <saga.utils.pty_shell.PTYShell object at 0x1038a8210> 2014:06:11 23:21:25 radical.pilot.MainProcess: [DEBUG ] open master pty for [ssh] [yellowstone.ucar.edu] mark: /usr/bin/env TERM=vt100 /usr/bin/ssh -t -o IdentityFile=/Users/mark/.ssh/id_rsa -o ControlMaster=yes -o ControlPath=/tmp/saga_sshmark%h_%p.12301.ctrl yellowstone.ucar.edu' 2014:06:11 23:21:25 radical.pilot.MainProcess: [DEBUG ] PTYProcess init <saga.utils.pty_process.PTYProcess object at 0x1038a8b90> 2014:06:11 23:21:25 radical.pilot.MainProcess: [INFO ] running: /usr/bin/env TERM=vt100 /usr/bin/ssh -t -o IdentityFile=/Users/mark/.ssh/id_rsa -o ControlMaster=yes -o ControlPath=/tmp/saga_sshmark%h%p.12301.ctrl yellowstone.ucar.edu 2014:06:11 23:21:25 radical.pilot.PilotLauncherWorker-1: [DEBUG ] Connected to MongoDB. Serving requests for PilotManager 5398c854a85378300d14bf33. 2014:06:11 23:21:25 radical.pilot.PilotLauncherWorker-1: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. 2014:06:11 23:21:26 radical.pilot.MainProcess: [DEBUG ] write: [ 8] [ 51](export PS1='$' > /dev/null 2>&1 || set prompt='$'n) 2014:06:11 23:21:26 radical.pilot.MainProcess: [DEBUG ] write: [ 8] [ 28](printf 'HELLO%d_SAGAn' 1n) 2014:06:11 23:21:26 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 15 2014:06:11 23:21:26 radical.pilot.MainProcess: [INFO ] got token prompt enter token: 2014:06:11 23:21:31 radical.pilot.MainProcess: [DEBUG ] write: [ 8] 45 2014:06:11 23:21:31 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 2 2014:06:11 23:21:31 radical.pilot.MainProcess: [DEBUG ] read : [ 8] [ 1023](Last login: Wed Jun 11 01:00:3 ... 00n--------------------------) 2014:06:11 23:21:31 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 54 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] write: [ 8] [ 51](export PS1='$' > /dev/null 2>&1 || set prompt='$'n) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] write: [ 8] [ 28](printf 'HELLO_%d_SAGAn' 2n) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 8] [ 37](Restoring modules to system defaultn) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 11 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] got initial shell prompt (5) ( Last login: Wed Jun 11 01:00:39 2014 from 24.132.227.242


Yellowstone Quick Start: www2.cisl.ucar.edu/resources/yellowstone/quick_start Software modules: https://www2.cisl.ucar.edu/resources/yellowstone/access_userenvironment/modules Key module commands: module list, module avail, module spider

CISL Help: cislhelp@ucar.edu -- 303-497-2400

Restoring modules to system default -bash-4.1$ ) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] waiting for prompt trigger HELLO_2_SAGA: (5) ( Last login: Wed Jun 11 01:00:39 2014 from 24.132.227.242


Yellowstone Quick Start: www2.cisl.ucar.edu/resources/yellowstone/quick_start Software modules: https://www2.cisl.ucar.edu/resources/yellowstone/access_userenvironment/modules Key module commands: module list, module avail, module spider

CISL Help: cislhelp@ucar.edu -- 303-497-2400

Restoring modules to system default -bash-4.1$ ) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 1 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] got initial shell prompt (5) ($) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] waiting for prompt trigger HELLO_2_SAGA: (5) ($) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 8] 15 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] got shell prompt trigger (4) (HELLO_2_SAGA) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] got initial shell prompt (5) ( $) 2014:06:11 23:21:32 radical.pilot.MainProcess: [INFO ] Got initial shell prompt (5) ( $) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] PTYProcess init <saga.utils.pty_process.PTYProcess object at 0x1038a8e10> 2014:06:11 23:21:32 radical.pilot.MainProcess: [INFO ] running: /usr/bin/env TERM=vt100 /usr/bin/ssh -t -o IdentityFile=/Users/mark/.ssh/id_rsa -o ControlMaster=no -o ControlPath=/tmp/saga_sshmark%h_%p.12301.ctrl yellowstone.ucar.edu 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 9] [ 1023](Last login: Wed Jun 11 15:21:3 ... 00n--------------------------) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 9] 54 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 9] [ 37](Restoring modules to system defaultn) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] read : [ 9] 11 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] got initial shell prompt (5) (Last login: Wed Jun 11 15:21:30 2014 from 24.132.227.242


Yellowstone Quick Start: www2.cisl.ucar.edu/resources/yellowstone/quick_start Software modules: https://www2.cisl.ucar.edu/resources/yellowstone/access_userenvironment/modules Key module commands: module list, module avail, module spider

CISL Help: cislhelp@ucar.edu -- 303-497-2400

Restoring modules to system default -bash-4.1$ ) 2014:06:11 23:21:32 radical.pilot.MainProcess: [INFO ] Got initial shell prompt (5) (Last login: Wed Jun 11 15:21:30 2014 from 24.132.227.242


Yellowstone Quick Start: www2.cisl.ucar.edu/resources/yellowstone/quick_start Software modules: https://www2.cisl.ucar.edu/resources/yellowstone/access_userenvironment/modules Key module commands: module list, module avail, module spider

CISL Help: cislhelp@ucar.edu -- 303-497-2400

Restoring modules to system default -bash-4.1$ ) 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] running command shell: exec /bin/sh -i 2014:06:11 23:21:32 radical.pilot.MainProcess: [DEBUG ] write: [ 9] [ 30](stty -echo ; exec /bin/sh -in) 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] read : [ 9] 8 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] write: [ 9] [ 81](unset PROMPT_COMMAND ; PS1='PROMPT-$?->'; PS2=''; export PS1 PS2 2>&1 >/dev/nulln) 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] read : [ 9] 10 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] got new shell prompt 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] run_sync: echo "PWD: $PWD" 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] write: [ 9] [ 17](echo "PWD: $PWD"n) 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] read : [ 9] [ 39](PWD: /glade/u/home/marksantnPROMPT-0-) 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] Determined remote working directory for ssh://yellowstone.ucar.edu/: '/glade/u/home/marksant' 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] PTYShell del <saga.utils.pty_shell.PTYShell object at 0x1038a8210> 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] PTYProcess del <saga.utils.pty_process.PTYProcess object at 0x1038a8e10> 2014:06:11 23:21:33 radical.pilot.MainProcess: [DEBUG ] Worker thread (ID: Thread-2[4385161216]) for UnitManager 5398c85da85378300d14bf35 started. 2014:06:11 23:21:33 radical.pilot.MainProcess: [INFO ] Loaded scheduler: DirectSubmissionScheduler. 2014:06:11 23:21:34 radical.pilot.PilotLauncherWorker-1: [INFO ] Launching ComputePilot {u'state': u'PendingLaunch', u'description': {u'project': u'URTG0003', u'resource': u'yellowstone.ucar.edu', u'queue': None, u'sandbox': None, u'cleanup': False, u'pilot_agent_priv': None, u'cores': 32, u'runtime': 15}, u'sagajobid': None, u'started': None, u'cores_per_node': None, u'output_transfer_started': None, u'sandbox': u'sftp://yellowstone.ucar.edu/glade/u/home/marksant/radical.pilot.sandbox/pilot-5398c855a85378300d14bf34/', u'submitted': datetime.datetime(2014, 6, 11, 21, 21, 33, 485000), u'unitmanager': u'5398c85da85378300d14bf35', u'output_transfer_finished': None, u'finished': None, u'pilotmanager': u'5398c854a85378300d14bf33', u'command': None, u'wu_queue': [], u'statehistory': [{u'timestamp': datetime.datetime(2014, 6, 11, 21, 21, 33, 484000), u'state': u'PendingLaunch'}], u'input_transfer_started': None, u'_id': ObjectId('5398c855a85378300d14bf34'), u'input_transfer_finished': None, u'nodes': None, u'log': []} 2014:06:11 23:21:34 radical.pilot.PilotLauncherWorker-1: [INFO ] Using pilot agent /Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/agent/radical-pilot-agent-multicore.py 2014:06:11 23:21:34 radical.pilot.PilotLauncherWorker-1: [INFO ] Using bootstrapper /Users/mark/.virtualenv/rp/lib/python2.7/site-packages/radical.pilot-0.14-py2.7.egg/radical/pilot/bootstrapper/default_bootstrapper.sh 2014:06:11 23:21:34 radical.pilot.PilotLauncherWorker-1: [DEBUG ] Creating agent sandbox 'sftp://yellowstone.ucar.edu/glade/u/home/marksant/radical.pilot.sandbox/pilot-5398c855a85378300d14bf34/'. 2014:06:11 23:21:34 radical.pilot.InputFileTransferWorker-2: [DEBUG ] Connected to MongoDB. Serving requests for UnitManager 5398c85da85378300d14bf35. 2014:06:11 23:21:34 radical.pilot.OutputFileTransferWorker-1: [DEBUG ] Connected to MongoDB. Serving requests for UnitManager 5398c85da85378300d14bf35. 2014:06:11 23:21:34 radical.pilot.OutputFileTransferWorker-2: [DEBUG ] Connected to MongoDB. Serving requests for UnitManager 5398c85da85378300d14bf35. 2014:06:11 23:21:34 radical.pilot.InputFileTransferWorker-1: [DEBUG ] Connected to MongoDB. Serving requests for UnitManager 5398c85da85378300d14bf35. 2014:06:11 23:21:34 radical.pilot.MainProcess: [INFO ] ComputePilot '5398c855a85378300d14bf34' state changed from 'PendingLaunch' to 'Launching'. [Callback]: ComputePilot '5398c855a85378300d14bf34' state changed to Launching. 2014:06:11 23:21:34 radical.pilot.InputFileTransferWorker-2: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. 2014:06:11 23:21:34 radical.pilot.OutputFileTransferWorker-1: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. 2014:06:11 23:21:34 radical.pilot.InputFileTransferWorker-1: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. 2014:06:11 23:21:34 radical.pilot.OutputFileTransferWorker-2: [DEBUG ] Found SSH context info: {'Type' : 'SSH', 'UserID' : 'None', 'UserKey' : 'None', 'UserPass' : 'None'}. 2014:06:11 23:21:35 radical.pilot.MainProcess: [INFO ] Scheduled ComputeUnits ['5398c85ea85378300d14bf36'] for execution on ComputePilot '5398c855a85378300d14bf34'. 2014:06:11 23:21:35 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '5398c85ea85378300d14bf36' state changed from 'New' to 'PendingExecution'. [Callback]: ComputeUnit '5398c85ea85378300d14bf36' state changed to PendingExecution. enter token:

marksantcroos commented 10 years ago

With SAGA_VERBOSE=DEBUG turned on:

python $ python mpi_yellowstone_test.py 2&> double_token_prompt.txt enter token: enter token: $

https://www.dropbox.com/s/3j3dm9u0gnlqxpk/double_token_prompt.txt

andre-merzky commented 10 years ago

Thanks Mark -- I think I see whats going on: the ssh master connection is shared for all connections coming from the same process (the pid of the process is part of the shared connection ID). That works nicely for multithreaded applications, as the threads share the PID -- but RP uses the multi_process_ing module... I have to read up a little if connection sharing over processes can cause us any trouble -- not on the ssh level really, but on the saga-python level, where we use locking between threads in order to keep traffic on the ssh channel serialized. The master channel should not be affected, but as said, I need to check this.

marksantcroos commented 10 years ago

Ok, no rush, rather wear out the key a bit faster than introducing regressions.

andre-merzky commented 10 years ago

Mark, this should be addressed with 09cb2f2 in saga-python's devel branch. Would you mind giving it a try? I don't have yubikey... You may need to manually close running ssh master sessions, and/or remove any cached pipes in /tmp/saga_ssh*.ctrl. Thanks!

marksantcroos commented 10 years ago

Still get two prompts. But didn't do the thorough cleanup yet. Will try that.

On 04 Jul 2014, at 20:00 , Andre Merzky notifications@github.com wrote:

Mark, this should be addressed with 09cb2f2 in saga-python's devel branch. Would you mind giving it a try? I don't have yubikey... You may need to manually close running ssh master sessions, and/or remove any cached pipes in /tmp/saga_ssh*.ctrl. Thanks!

— Reply to this email directly or view it on GitHub.

andre-merzky commented 10 years ago

Hmm -- if that still persists, can you deposit a full log somewhere, please? Thanks! :)

marksantcroos commented 10 years ago

On 04 Jul 2014, at 22:54 , Andre Merzky notifications@github.com wrote:

Hmm -- if that still persists, can you deposit a full log somewhere, please? Thanks! :)

Removed all ssh files from /tmp and killed all ssh processes, still no cookie.

Full log at:

https://www.dropbox.com/s/nqu3gi4i3d0ttod/double_token_prompt2.txt

andre-merzky commented 10 years ago

Yes, its indeed opening two master channels still, within the same thread, and with the same control pipe. This is a real bug now, and fortunately easier to fix :) Thanks for the log thus!

andre-merzky commented 10 years ago

This should be addressed in c77dd73 now -- could you please give it another try? Please clean out the connection caches before again, and also, please run the test twice with (relatively) short delay (<1 hour).

Thanks! :)

marksantcroos commented 10 years ago

On 05 Jul 2014, at 12:11 , Andre Merzky notifications@github.com wrote:

This should be addressed in c77dd73 now -- could you please give it another try? Please clean out the connection caches before again, and also, please run the test twice with (relatively) short delay (<1 hour).

Great it works!

You've just single-handedly doubled the lifetime of my yubikey (not to mention the wear on my finger prints) ;-)

andre-merzky commented 10 years ago

:)

The cache should now actually also survive individual sessions -- how well that works depends on several system settings though...

On Sat, Jul 5, 2014 at 12:18 PM, Mark Santcroos notifications@github.com wrote:

On 05 Jul 2014, at 12:11 , Andre Merzky notifications@github.com wrote:

This should be addressed in c77dd73 now -- could you please give it another try? Please clean out the connection caches before again, and also, please run the test twice with (relatively) short delay (<1 hour).

Great it works!

You've just single-handedly doubled the lifetime of my yubikey (not to mention the wear on my finger prints) ;-)

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/radical.pilot/issues/164#issuecomment-48082991 .

It was a sad and disappointing day when I discovered that my Universal Remote Control did not, in fact, control the Universe. (Not even remotely.)

andre-merzky commented 10 years ago

Alas, this change indeed causes some issues -- I'll need to better understand how auto ssh channel caching works... Thus reverting this commit for the time being.

andre-merzky commented 10 years ago

Mark, would you mind trying the following patch in pty_shell_factory.py on yellowstone:

_SSH_FLAGS_MASTER   = "-o ControlMaster=yes -o ControlPath=%(ctrl)s"
_SSH_FLAGS_SLAVE    = "-o ControlMaster=no   -o ControlPath=%(ctrl)s"

to

_SSH_FLAGS_MASTER   = "-o ControlMaster=auto -o ControlPath=%(ctrl)s"
_SSH_FLAGS_SLAVE    = "-o ControlMaster=auto  -o ControlPath=%(ctrl)s"

Thanks!

marksantcroos commented 10 years ago

On 09 Jul 2014, at 1:10 , Andre Merzky notifications@github.com wrote:

Mark, would you mind trying the following patch in pty_shell_factory.py on yellowstone:

With "auto, auto", the connection sharing works (only asks for the token once).

But MTMS file transfers block. So far the only combination that works for MTMS is "yes, no" (but then need to use token twice).

andre-merzky commented 10 years ago

Thanks for checking... Well, the clean way then is to make the channel cache based on the SAGA session (instead of process or user id). Lets see how ssh likes that... :P Well, I wanted to introduce proper session IDs for a while now, for stack consistency. That's a good motivator... :)

andre-merzky commented 10 years ago

I'll transfer this issue to saga-python, as this is not fixable on RP level.

https://github.com/radical-cybertools/saga-python/issues/337