jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
582 stars 308 forks source link

ssh timeouts during start/restart (Mac OS X Lion) #112

Open stephenhow opened 12 years ago

stephenhow commented 12 years ago

On my Mac (OS X Lion) I'm getting ssh timeouts when starting a new cluster, or trying to re-start it. This happens almost 100% of the time on a start/restart. I've been able to create one or two clusters from my Mac, but otherwise starcluster start almost always fails.

I work around the problem by running starcluster from a Linux instance.

stephenhow commented 12 years ago

2012-05-25 07:49:25,277 PID: 94680 threadpool.py:123 - INFO - Shutting down threads... 2012-05-25 07:49:25,278 PID: 94680 threadpool.py:135 - DEBUG - unfinished_tasks = 20 2012-05-25 07:49:26,626 PID: 94680 cli.py:287 - DEBUG - Traceback (most recent call last): File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cli.py", line 255, in main sc.execute(args) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/commands/restart.py", line 26, in execute self.cm.restart_cluster(arg) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 173, in restart_cluster cl.restart_cluster() File "", line 2, in restart_cluster File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/utils.py", line 87, in wrap_f res = func(_arg, _kargs) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1316, in restart_cluster self.setup_cluster() File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1446, in setup_cluster self._setup_cluster() File "", line 2, in _setup_cluster File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/utils.py", line 87, in wrap_f res = func(_arg, _kargs) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1460, in _setup_cluster self.cluster_shell, self.volumes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/clustersetup.py", line 350, in run self._setup_passwordless_ssh() File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/clustersetup.py", line 226, in _setup_passwordless_ssh master.enable_passwordless_ssh('root', nodes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/node.py", line 523, in enable_passwordless_ssh self.copy_remote_file_to_nodes(known_hosts_file, nodes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/node.py", line 550, in copy_remote_file_to_nodes nrf.write(contents) File "build/bdist.macosx-10.7-intel/egg/ssh/file.py", line 314, in write self._write_all(data) File "build/bdist.macosx-10.7-intel/egg/ssh/file.py", line 435, in _write_all count = self._write(data) File "build/bdist.macosx-10.7-intel/egg/ssh/sftp_file.py", line 165, in _write t, msg = self.sftp._read_response(req) File "build/bdist.macosx-10.7-intel/egg/ssh/sftp_client.py", line 667, in _read_response raise SSHException('Server connection dropped: %s' % (str(e),)) SSHException: Server connection dropped:

---------- SYSTEM INFO ---------- StarCluster: 0.93.3 Python: 2.7.1 (r271:86832, Jul 31 2011, 19:30:53) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] Platform: Darwin-11.3.0-x86_64-i386-64bit boto: 2.3.0 ssh: 1.7.13 Crypto: 2.5 jinja2: 2.6 decorator: 3.3.1

stephenhow commented 12 years ago

Mounting all NFS export path(s) on 47 worker node(s) 47/47 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
Setting up NFS took 0.221 mins Configuring passwordless ssh for root No handlers could be found for logger "ssh.transport" Shutting down threads... 20/20 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
Traceback (most recent call last): File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cli.py", line 255, in main sc.execute(args) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/commands/restart.py", line 26, in execute self.cm.restart_cluster(arg) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 173, in restart_cluster cl.restart_cluster() File "", line 2, in restart_cluster File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/utils.py", line 87, in wrap_f res = func(_arg, _kargs) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1316, in restart_cluster self.setup_cluster() File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1446, in setup_cluster self._setup_cluster() File "", line 2, in _setup_cluster File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/utils.py", line 87, in wrap_f res = func(_arg, _kargs) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cluster.py", line 1460, in _setup_cluster self.cluster_shell, self.volumes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/clustersetup.py", line 350, in run self._setup_passwordless_ssh() File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/clustersetup.py", line 226, in _setup_passwordless_ssh master.enable_passwordless_ssh('root', nodes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/node.py", line 519, in enable_passwordless_ssh self.copy_remote_file_to_nodes(priv_key_file, nodes) File "/Library/Python/2.7/site-packages/StarCluster-0.93.3-py2.7.egg/starcluster/node.py", line 550, in copy_remote_file_to_nodes nrf.write(contents) File "build/bdist.macosx-10.7-intel/egg/ssh/file.py", line 314, in write self._write_all(data) File "build/bdist.macosx-10.7-intel/egg/ssh/file.py", line 435, in _write_all count = self._write(data) File "build/bdist.macosx-10.7-intel/egg/ssh/sftp_file.py", line 165, in _write t, msg = self.sftp._read_response(req) File "build/bdist.macosx-10.7-intel/egg/ssh/sftp_client.py", line 667, in _read_response raise SSHException('Server connection dropped: %s' % (str(e),)) SSHException: Server connection dropped:

!!! ERROR - Oops! Looks like you've found a bug in StarCluster !!! ERROR - Crash report written to: /Users/show/.starcluster/logs/crash-report-94732.txt !!! ERROR - Please remove any sensitive data from the crash report !!! ERROR - and submit it to starcluster@mit.edu Macintosh:otr_trips show$

jtriley commented 12 years ago

@stephenhow Would you mind installing the latest development version from github and see if you still have the same issues? It seems like this is a faulty network connection problem given the exception message "Server connection dropped" but could also be due to an issue with PyCrypto. Using the latest git version will pull in a newer version of the python ssh library and consequently PyCrypto which might fix this issue.