Starting c3.4xlarge cluster with HVM image gives cryptic error message

cdoersch commented 10 years ago

I just attempted to start a 1-node spot cluster with c3.4xlarge nodes (with a fresh starcluster install from git). I assumed that the HVM image was the right image to use with this cluster, but apparently I was wrong. However, rather than giving me a proper error message, I just get this:

StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Launching master node (ami: ami-52a0c53b, type: c3.4xlarge)...
>>> Creating security group @sc-zoneccluster2...
>>> Creating placement group @sc-zoneccluster2...
SpotInstanceRequest:sir-094a5a32
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
!!! ERROR - No active cluster nodes found!

No useful info appears in the error log either. Using the standard 64-bit image works fine.

cdoersch commented 10 years ago

After playing around a bit more, this seems to be a repeat of #105; I just got confused since it worked when I switched the image, but it appears that I sometimes get this behavior even when I'm using the right image. That bug's supposed to be fixed in git, though...

jtriley commented 10 years ago

@cdoersch What does listclusters show after this happens? Do the nodes actually come up?

cdoersch commented 10 years ago

Note that I have one other cluster (zoneccluster) running in the background--hopefully that doesn't affect anything

ubuntu@ip-10-29-251-74:~$ starcluster start --force-spot-master -c smallcluster newcluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Launching master node (ami: ami-765b3e1f, type: m2.xlarge)...
>>> Creating security group @sc-newcluster...
SpotInstanceRequest:sir-d11d5434
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% 
!!! ERROR - No active cluster nodes found!
ubuntu@ip-10-29-251-74:~$ starcluster listclusters
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

-------------------------------------------
newcluster (security group: @sc-newcluster)
-------------------------------------------
Launch time: N/A
Uptime: N/A
Zone: N/A
Keypair: N/A
EBS volumes: N/A
Cluster nodes: N/A

-------------------------------------------------
zoneccluster3 (security group: @sc-zoneccluster3)
-------------------------------------------------
Launch time: 2014-01-03 00:02:49
Uptime: 2 days, 21:04:06
Zone: us-east-1c
Keypair: cdoerschkey
EBS volumes:
    vol-59248b2a on master:/dev/sdz (status: attached)
    vol-b650afc6 on master:/dev/sdy (status: attached)
Spot requests: 1 active
Cluster nodes:
     master running i-9f8479b1 ec2-23-22-83-117.compute-1.amazonaws.com (spot sir-c243da32)
Total nodes: 1

cdoersch commented 10 years ago

Interestingly, starcluster seems to launch the cluster correctly if I change the name of the cluster I'm trying to launch, even if all the configs stay the same. Something must not be getting completely purged when I do a starcluster terminate -f.

jtriley commented 10 years ago

@cdoersch Sorry for the delay. Looking at the listclusters output immediately after this happens it's clear that the spot requests are being closed/cancelled for one reason or another. If this weren't the case any spot requests/instances associated with the cluster would be listed.

Unfortunately this is not always fixable - StarCluster warns you when creating spot clusters that spot requests are not guaranteed to come up based on supply/demand or other factors. If you're repeatedly hitting this issue please paste the output of the following command immediately after this happens:

$ starcluster listspots -c

In general this will show you all spot requests in your account and give you details about each request including the status code and message. You may also want to check the status of your spot requests in the AWS console.

cdoersch commented 10 years ago

I just had this issue again this morning; this seems to happen almost completely at random. I'm now starting to notice similarities with issue #274 --I do seem to end up with the insufficient capacity error for my spot instances, and it seems to happen at times when us-east is under relatively heavy load. However, usually starcluster crashes with "ERROR - No active cluster nodes found!" before they've even reached the insufficient-capacity state; starcluster exits without waiting at all for them to come up, and they'll eventually show up in the EC2 management console in the pending-evaluation state after starcluster has exited. Usually it seems that if I randomly change my settings for a while, eventually the cluster will eventually launch. Changing the cluster name to something I haven't used before, and changing the instance type to one I haven't used recently seem to do the most to increase the likelihood of getting a successful launch. For instance, I just launched a 1-node m3.2xlarge cluster successfully, terminated it after a minute, and then tried to launch an identical cluster about 5 minutes later, which failed. However, these observations are based on few samples and could be a red herring.

Here's an example from the logs:

2014-05-29 20:36:10,486 PID: 28311 config.py:138 - DEBUG - Loading file: /home/ubuntu/.starcluster/config
2014-05-29 20:36:10,494 PID: 28311 awsutils.py:75 - DEBUG - creating self._conn w/ connection_authenticator kwargs = {'proxy_user': None, 'proxy_pass': None, 'proxy_port': None, 'proxy': None, 'is_secure': True, 'path': '/', 'region': None, 'validate_certs': True, 'port': None}
2014-05-29 20:36:10,592 PID: 28311 cluster.py:1771 - INFO - Validating cluster template settings...
2014-05-29 20:36:10,760 PID: 28311 sshutils.py:859 - DEBUG - rsa private key fingerprint (/home/ubuntu/.ssh/cdoerschkey.pem): 14:f8:38:87:71:7e:52:16:d6:24:e4:b2:87:16:25:9c:4a:b8:fe:84
2014-05-29 20:36:11,513 PID: 28311 cluster.py:888 - DEBUG - Userdata size in KB: 0.65
2014-05-29 20:36:11,513 PID: 28311 cluster.py:1788 - INFO - Cluster template settings are valid
2014-05-29 20:36:11,513 PID: 28311 cluster.py:1609 - INFO - Starting cluster...
2014-05-29 20:36:11,513 PID: 28311 cluster.py:1119 - INFO - Launching a 1-node cluster...
2014-05-29 20:36:11,514 PID: 28311 cluster.py:1178 - INFO - Launching master node (ami: ami-52a0c53b, type: cc1.4xlarge)...
2014-05-29 20:36:11,579 PID: 28311 awsutils.py:272 - INFO - Creating security group @sc-zonebcluster2...
2014-05-29 20:36:12,884 PID: 28311 awsutils.py:402 - INFO - Creating placement group @sc-zonebcluster2...
2014-05-29 20:36:13,033 PID: 28311 cluster.py:888 - DEBUG - Userdata size in KB: 0.65
2014-05-29 20:36:13,036 PID: 28311 cluster.py:888 - DEBUG - Userdata size in KB: 0.65
2014-05-29 20:36:13,152 PID: 28311 awsutils.py:482 - DEBUG - Forcing delete_on_termination for AMI: ami-52a0c53b
2014-05-29 20:36:13,777 PID: 28311 cluster.py:937 - INFO - SpotInstanceRequest:sir-64b5de4f
2014-05-29 20:36:13,777 PID: 28311 cluster.py:1410 - INFO - Waiting for cluster to come up... (updating every 30s)
2014-05-29 20:36:14,018 PID: 28311 cluster.py:1367 - INFO - Waiting for all nodes to be in a 'running' state...
2014-05-29 20:36:14,076 PID: 28311 cluster.py:744 - DEBUG - existing nodes: {}
2014-05-29 20:36:14,076 PID: 28311 cluster.py:760 - DEBUG - returning self._nodes = []
2014-05-29 20:36:14,160 PID: 28311 cli.py:300 - ERROR - No active cluster nodes found!

Below is a list of terminated instances:

i-33f40b61 (terminated)

Interestingly, i-33f40b61 is an instance from a previous run of the cluster; it's not even the same instance type as the cluster I'm trying to launch.

jtriley / StarCluster

Starting c3.4xlarge cluster with HVM image gives cryptic error message #353