radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

SSH error on workflow machine, tutorial branch #1074

Closed solejar closed 8 years ago

solejar commented 8 years ago

I got most of the example scripts working on Stampede yesterday without error. That was in the office on a wired connection. I am now trying to run them at home on my wireless connection and I am getting some sort of SSH permission denial. I do not know what the origin of this error is, as I thought that the workflow machine automatically allowed for passwordless SSHing into XSEDE machines. I tried SSHing manually into Stampede, and I got the same permission denial. The error occurs at the pilot submission stage.

2016-07-12 13:38:25,343: radical.pilot       : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.5 (default, Nov 20 2015, 02:00:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
2016-07-12 13:38:25,343: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      pid: 19101
2016-07-12 13:38:25,343: radical.pilot       : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2016-07-12 13:38:25,344: radical.pilot       : MainProcess                     : MainThread     : INFO    : radical.pilot        version: v0.40.3-14-g79d3d61@tutorial-xsede16
2016-07-12 13:38:25,636: radical.pilot       : MainProcess                     : MainThread     : WARNING : using default dburl mongodb://rp:rp@ds015335.mlab.com:15335/rp
2016-07-12 13:38:25,636: radical.pilot       : MainProcess                     : MainThread     : INFO    : using database mongodb://rp:rp@ds015335.mlab.com:15335/rp
2016-07-12 13:38:25,862: radical.pilot       : MainProcess                     : MainThread     : INFO    : New Session created: {'database_url': 'mongodb://rp:rp@ds015335.mlab.com:15335/rp', 'connected': 1468345105.810274, 'uid': 'rp.session.workflow.iu.xsede.org.solejar.016994.0003', 'closed': None, 'created': 1468345105.810274}.
2016-07-12 13:38:25,863: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_aliases.json
2016-07-12 13:38:25,865: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_das4.json
2016-07-12 13:38:25,874: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for das4.fs2
2016-07-12 13:38:25,876: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_epsrc.json
2016-07-12 13:38:25,887: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for epsrc.archer
2016-07-12 13:38:25,889: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for epsrc.archer_orte
2016-07-12 13:38:25,890: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_iu.json
2016-07-12 13:38:25,900: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for iu.bigred2
2016-07-12 13:38:25,901: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for iu.bigred2_ccm
2016-07-12 13:38:25,902: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_local.json
2016-07-12 13:38:25,916: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for local.localhost_anaconda
2016-07-12 13:38:25,917: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for local.localhost_yarn
2016-07-12 13:38:25,924: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for lrz.supermuc
2016-07-12 13:38:25,926: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_ncar.json
2016-07-12 13:38:25,930: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for ncar.yellowstone
2016-07-12 13:38:25,931: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_nersc.json
2016-07-12 13:38:25,962: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.edison
2016-07-12 13:38:25,964: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.edison_aprun
2016-07-12 13:38:25,965: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.hopper_ccm
2016-07-12 13:38:25,966: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.hopper_aprun
2016-07-12 13:38:25,967: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.edison_ccm
2016-07-12 13:38:25,969: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for nersc.hopper
2016-07-12 13:38:25,970: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_ornl.json
2016-07-12 13:38:25,976: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for ornl.titan
2016-07-12 13:38:25,978: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_osg.json
2016-07-12 13:38:25,987: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for osg.xsede-virt-clust
2016-07-12 13:38:25,988: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for osg.connect
2016-07-12 13:38:25,989: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_radical.json
2016-07-12 13:38:25,994: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for radical.tutorial
2016-07-12 13:38:25,996: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_rice.json
2016-07-12 13:38:26,004: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for rice.biou
2016-07-12 13:38:26,006: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for rice.davinci
2016-07-12 13:38:26,007: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_stfc.json
2016-07-12 13:38:26,012: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for stfc.joule
2016-07-12 13:38:26,013: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_chameleon.json
2016-07-12 13:38:26,017: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for chameleon.cloud_vm_yarn
2016-07-12 13:38:26,018: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_futuregrid.json
2016-07-12 13:38:26,044: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.xray
2016-07-12 13:38:26,045: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.echo
2016-07-12 13:38:26,046: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.india
2016-07-12 13:38:26,047: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.bravo
2016-07-12 13:38:26,048: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.xray_ccm
2016-07-12 13:38:26,049: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for futuregrid.delta
2016-07-12 13:38:26,050: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_ncsa.json
2016-07-12 13:38:26,066: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for ncsa.bw_ccm
2016-07-12 13:38:26,067: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for ncsa.bw_aprun
2016-07-12 13:38:26,068: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for ncsa.bw
2016-07-12 13:38:26,070: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_xsede.json
2016-07-12 13:38:26,114: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.blacklight
2016-07-12 13:38:26,116: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.stampede_yarn
2016-07-12 13:38:26,117: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.supermic
2016-07-12 13:38:26,118: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.lonestar
2016-07-12 13:38:26,119: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.comet
2016-07-12 13:38:26,120: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.trestles
2016-07-12 13:38:26,122: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.stampede
2016-07-12 13:38:26,123: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.comet_orte
2016-07-12 13:38:26,124: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for xsede.gordon
2016-07-12 13:38:26,125: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations from /home/solejar/ve/lib/python2.7/site-packages/radical/pilot/configs/resource_yale.json
2016-07-12 13:38:26,130: radical.pilot       : MainProcess                     : MainThread     : INFO    : Load resource configurations for yale.grace
2016-07-12 13:38:26,161: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : Worker thread (ID: Thread-1[140390410880768]) for PilotManager pmgr.0000 started.
2016-07-12 13:38:26,162: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: DEBUG   : Connected to MongoDB. Serving requests for PilotManager pmgr.0000.
2016-07-12 13:38:26,171: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : saga.utils.PTYShell ('gsissh://stampede.tacc.utexas.edu:2222/')
2016-07-12 13:38:26,636: radical.saga.pty    : MainProcess                     : MainThread     : ERROR   : read from process failed '[Errno 5] Input/output error' : (Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
) ((Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
)) (/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +61 (translate_exception)  :  e = se.PermissionDenied (cmsg))
Traceback (most recent call last):
  File "/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 263, in _initialize_pty
    n, match = pty_shell.find (prompt_patterns, delay)
  File "/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_process.py", line 790, in find
    raise ptye.translate_exception (e, "(%s)" % data)
PermissionDenied: read from process failed '[Errno 5] Input/output error' : (Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
) ((Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
)) (/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +61 (translate_exception)  :  e = se.PermissionDenied (cmsg))
2016-07-12 13:38:26,637: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.workflow.iu.xsede.org.solejar.016994.0003 closing
2016-07-12 13:38:26,638: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.workflow.iu.xsede.org.solejar.016994.0003 closes   pmgr   pmgr.0000
2016-07-12 13:38:26,638: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 closing
2016-07-12 13:38:26,638: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancel   launcher Thread-1
2016-07-12 13:38:26,638: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 disables launcher PilotLauncherWorker-1
2016-07-12 13:38:26,638: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 disabling
2016-07-12 13:38:26,639: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 disabled
2016-07-12 13:38:26,639: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 disabled launcher PilotLauncherWorker-1
2016-07-12 13:38:26,639: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 canceled launcher Thread-1
2016-07-12 13:38:26,664: radical.pilot       : MainProcess                     : MainThread     : INFO    : Sent 'COMMAND_CANCEL_PILOT' command to pilots [].
2016-07-12 13:38:26,665: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancel   worker Thread-1
2016-07-12 13:38:26,665: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 stops   launcher PilotLauncherWorker-1
2016-07-12 13:38:26,665: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 stopping
2016-07-12 13:38:27,330: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : launcher PilotLauncherWorker-1 stopped
2016-07-12 13:38:27,331: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 stopped launcher PilotLauncherWorker-1
2016-07-12 13:38:27,331: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 canceled worker Thread-1
2016-07-12 13:38:27,331: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 stops    worker Thread-1
2016-07-12 13:38:27,331: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 stopping
2016-07-12 13:38:28,219: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : pworker Thread-1 stops   launcher PilotLauncherWorker-1
2016-07-12 13:38:28,219: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : launcher PilotLauncherWorker-1 stopping
2016-07-12 13:38:28,220: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : launcher PilotLauncherWorker-1 stopped
2016-07-12 13:38:28,220: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : pworker Thread-1 stopped launcher PilotLauncherWorker-1
2016-07-12 13:38:28,220: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pworker Thread-1 stopped
2016-07-12 13:38:28,221: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 stopped  worker Thread-1
2016-07-12 13:38:28,221: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 closed
2016-07-12 13:38:28,221: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.workflow.iu.xsede.org.solejar.016994.0003 closed   pmgr   pmgr.0000
2016-07-12 13:38:28,221: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.workflow.iu.xsede.org.solejar.016994.0003 closed
Traceback (most recent call last):
  File "00_getting_started.py", line 70, in <module>
    pilot = pmgr.submit_pilots(pdesc)
  File "/home/solejar/ve/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 373, in submit_pilots
    resource_config=resource_cfg)
  File "/home/solejar/ve/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 433, in register_start_pilot_request
    shell = sup.PTYShell(url, self._session)
  File "/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_shell.py", line 247, in __init__
    interactive=self.interactive)
  File "/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 198, in initialize
    self._initialize_pty (info['pty'], info)
  File "/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_shell_factory.py", line 413, in _initialize_pty
    raise ptye.translate_exception (e)
saga.exceptions.PermissionDenied: read from process failed '[Errno 5] Input/output error' : (Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
) ((Warning: Permanently added the ECDSA host key for IP address '[129.114.62.13]:2222' to the list of known hosts.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,keyboard-interactive).
)) (/home/solejar/ve/lib/python2.7/site-packages/saga/utils/pty_exceptions.py +61 (translate_exception)  :  e = se.PermissionDenied (cmsg))
vivek-bala commented 8 years ago

Hey Sean,

Just to make sure, to interactively login from workflows to stampede/comet, you need to use gsissh stampede/comet and not ssh. The RP examples (along with config.json) should already be doing that.

I also faced this issue. Might not be the same cause (in my case, I required a new certificate). Could you try:

$ myproxy-logon
<enter xsede password when prompted>
$ gsissh stampede
solejar commented 8 years ago

That fixed it! Not sure why it was working fine originally and then stopped working today, but the myproxy-logon got it working.

vivek-bala commented 8 years ago

Great !

andre-merzky commented 8 years ago

The default proxy lifetime on the workflow machine is, I believe, 24 hours or so. If you keep your session alive longer than that, you'll end up with an invalid proxy. Or indeed, as it happened to Vivek, if you create a proxy for BlueWaters, that new proxy will overwrite (and thus invalidate) the old XSEDE proxy...

FWIW, the option -t 48 to myproxy-logon will give you a proxy which is valid for two days. I think the max is 10 days usually.

vivek-bala commented 8 years ago

I actually thought a new certificate is obtained every time I login, since I don't think I faced this issue before yesterday (during the previous tutorial setups, etc.). Maybe they changed it ?

Currently, default is 12 hours and max is 11 days.

andre-merzky commented 8 years ago

Yes, its obtained every time you login. But if you keep your session alive for longer than those 12 hours, the proxy will be dead...

vivek-bala commented 8 years ago

Ah ok, that makes sense. Thanks !