Open liffiton opened 10 years ago
Hm, and another crash with a different stack trace. This was while I was performing an scp copy from the cluster (i.e., loading the network, I suppose, though it was up the whole time) in another terminal:
>>> Sleeping...(looping again in 60 secs)
!!! ERROR - Connection error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/starcluster/cli.py", line 274, in main
sc.execute(args)
File "/usr/local/lib/python2.7/dist-packages/starcluster/commands/loadbalance.py", line 115, in execute
lb.run(cluster)
File "/usr/local/lib/python2.7/dist-packages/starcluster/balancers/sge/__init__.py", line 592, in run
if not cluster.is_cluster_up():
File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 1185, in is_cluster_up
spots = self.spot_requests
File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 771, in spot_requests
return self.ec2.get_all_spot_requests(filters=filters)
File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 662, in get_all_spot_requests
filters=filters)
File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1260, in get_all_spot_instance_requests
[('item', SpotInstanceRequest)], verb='POST')
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1106, in get_list
body = response.read()
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 412, in read
self._cached_response = httplib.HTTPResponse.read(self)
File "/usr/lib/python2.7/httplib.py", line 543, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 601, in _read_chunked
value.append(self._safe_read(chunk_left))
File "/usr/lib/python2.7/httplib.py", line 658, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 305, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 224, in read
return self._sslobj.read(len)
SSLError: The read operation timed out
!!! ERROR - Check your internet connection?
Would this be a reasonable solution?
If that sounds alright, I can go ahead and test it and submit a pull request. I don't want to waste your time if there's a better solution, though.
@liffiton Thanks for reporting and in general your fix seems reasonable except for the fact that these errors can occur on any EC2 API call and potentially leave "dangling" (or unconfigured) nodes.
In these tracebacks you linked the error is harmless and simply restarting the loop would be fine given that all that's occurring is a fetch of spot requests and instance requests. My concern is with more mission-critical API calls such as running instances or submitting new spot requests.
I talked to @FinchPowers about this and it seems he has some load balancer improvements that can help with this. His improvements check for the more serious cases of dangling nodes which should allow us to implement this fix with more confidence that it's not leaving things in a bad state.
@FinchPowers agreed to submit some PRs for this later today. I'm going to try to merge these changes first and then we can discuss implementing this fix. Thanks!!
The load balancer was running with '-K' to terminate a cluster once it was done processing, but one connection attempt failed, and the entire process crashed. This left the cluster running until I found and manually terminated it several hours after it could have been shut down automatically. Ideally, loadbalance would keep attempting to connect if one connection failed, allowing it to manage clusters even after transient network failures.
The exact error encountered: