jtriley / StarCluster

StarCluster is an open source cluster-computing toolkit for Amazon's Elastic Compute Cloud (EC2).
http://star.mit.edu/cluster
GNU Lesser General Public License v3.0
582 stars 308 forks source link

loadbalance crashes if network connection fails #337

Open liffiton opened 10 years ago

liffiton commented 10 years ago

The load balancer was running with '-K' to terminate a cluster once it was done processing, but one connection attempt failed, and the entire process crashed. This left the cluster running until I found and manually terminated it several hours after it could have been shut down automatically. Ideally, loadbalance would keep attempting to connect if one connection failed, allowing it to manage clusters even after transient network failures.

The exact error encountered:

>>> Looking for nodes to remove...
!!! ERROR - Connection error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cli.py", line 274, in main
    sc.execute(args)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/commands/loadbalance.py", line 115, in execute
    lb.run(cluster)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/balancers/sge/__init__.py", line 614, in run
    self._eval_remove_node()
  File "/usr/local/lib/python2.7/dist-packages/starcluster/balancers/sge/__init__.py", line 715, in _eval_remove_node
    remove_nodes = self._find_nodes_for_removal(max_remove=max_remove)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/balancers/sge/__init__.py", line 775, in _find_nodes_for_removal
    for node in self._cluster.running_nodes:
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 761, in running_nodes
    return self._nodes_in_states(['running'])
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 757, in _nodes_in_states
    return filter(lambda x: x.state in states, self.nodes)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 703, in nodes
    nodes = self.ec2.get_all_instances(filters=filters)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 629, in get_all_instances
    filters=filters)
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 558, in get_all_instances
    filters=filters, dry_run=dry_run)
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 628, in get_all_reservations
    [('item', Reservation)], verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1106, in get_list
    body = response.read()
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 412, in read
    self._cached_response = httplib.HTTPResponse.read(self)
  File "/usr/lib/python2.7/httplib.py", line 543, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 585, in _read_chunked
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/lib/python2.7/ssl.py", line 305, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 224, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out
!!! ERROR - Check your internet connection?
liffiton commented 10 years ago

Hm, and another crash with a different stack trace. This was while I was performing an scp copy from the cluster (i.e., loading the network, I suppose, though it was up the whole time) in another terminal:

>>> Sleeping...(looping again in 60 secs)

!!! ERROR - Connection error:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cli.py", line 274, in main
    sc.execute(args)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/commands/loadbalance.py", line 115, in execute
    lb.run(cluster)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/balancers/sge/__init__.py", line 592, in run
    if not cluster.is_cluster_up():
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 1185, in is_cluster_up
    spots = self.spot_requests
  File "/usr/local/lib/python2.7/dist-packages/starcluster/cluster.py", line 771, in spot_requests
    return self.ec2.get_all_spot_requests(filters=filters)
  File "/usr/local/lib/python2.7/dist-packages/starcluster/awsutils.py", line 662, in get_all_spot_requests
    filters=filters)
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 1260, in get_all_spot_instance_requests
    [('item', SpotInstanceRequest)], verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1106, in get_list
    body = response.read()
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 412, in read
    self._cached_response = httplib.HTTPResponse.read(self)
  File "/usr/lib/python2.7/httplib.py", line 543, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 601, in _read_chunked
    value.append(self._safe_read(chunk_left))
  File "/usr/lib/python2.7/httplib.py", line 658, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 305, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 224, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out
!!! ERROR - Check your internet connection?
liffiton commented 10 years ago

Would this be a reasonable solution?

If that sounds alright, I can go ahead and test it and submit a pull request. I don't want to waste your time if there's a better solution, though.

jtriley commented 10 years ago

@liffiton Thanks for reporting and in general your fix seems reasonable except for the fact that these errors can occur on any EC2 API call and potentially leave "dangling" (or unconfigured) nodes.

In these tracebacks you linked the error is harmless and simply restarting the loop would be fine given that all that's occurring is a fetch of spot requests and instance requests. My concern is with more mission-critical API calls such as running instances or submitting new spot requests.

I talked to @FinchPowers about this and it seems he has some load balancer improvements that can help with this. His improvements check for the more serious cases of dangling nodes which should allow us to implement this fix with more confidence that it's not leaving things in a bad state.

@FinchPowers agreed to submit some PRs for this later today. I'm going to try to merge these changes first and then we can discuss implementing this fix. Thanks!!