nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

Flintrocks hang on, "instances granted" #167

Closed bakztfuture closed 7 years ago

bakztfuture commented 7 years ago

Hi everyone,

I am requesting 3 c4.8xlarge spot instances in us-east-1a at a bid of $1.0, but, for some reason flintrock hangs at All 3 Instances Granted. I have created a VPC, subnet, placement group, and security group all in us-east-1a and am trying to launch this project.

When I try to SSH into each of the instances after, the connection times out. I know my pem file is in working order because I have used the same one earlier today to access an independently created ec2 instance.

In my security group, I have it set as: SSH TCP 22 0.0.0.0/0 for inbound and outbound

In my ec2 dashboard, I know all 3 instances do get created and end up in the running state and also passing 2/2 Status Checks.

However, Eventually, flintrock says There was a problem with the launch. Cleaning up... and terminates the instances ... as I'm terminating I sometimes get the message:

Terminating instances...
[54.164.67.42] Could not connect via SSH.
Makefile:117: recipe for target 'aws_spark_flintrock_create' failed
make: *** [aws_spark_flintrock_create] Error 1

Any ideas what's going on? Any way that I could troubleshoot better? Access any logs? Thank you!

nchammas commented 7 years ago
Makefile:117: recipe for target 'aws_spark_flintrock_create' failed
make: *** [aws_spark_flintrock_create] Error 1

It looks like you're running Flintrock from within another project.

Some initial questions for you:

  1. Does this happen consistently?
  2. Does his happen when you run Flintrock alone?
  3. Does this happen with Flintrock 0.7.0?
  4. Are you sure you're running Python 3.0? Flintrock only supports Python 3.4+.
bakztfuture commented 7 years ago

@nchammas

  1. The project (with flintrock) was working very well just a month or two ago, with the same keys/config, code base, same machine as well. The only difference is I used to point all of the configuration/credentials to us-east-1b and am now pointing it to us-east-1a

This is a consistent problem I have been running into. Have tried to launch a cluster over a dozen times now.

  1. Yes, I just ran:

    flintrock launch test-cluster     --num-slaves 1     --spark-version 2.0.2     --ec2-key-name xxx     --ec2-identity-file xxx.pem     --ec2-ami ami-b73b63a0     --ec2-user ec2-user

    but its still getting hung up here:

    Requesting 2 spot instances at a max price of $1.0...
    0 of 2 instances granted. Waiting...
    All 2 instances granted.
  2. I tried installing Flintrock 0.7.0 on the machine but ran into errors, can explore this if in more detail if necessary

  3. Not exactly sure how I can tell - the dockerfile for the project installs both python and python3. I ran python --version inside the docker container and it says its 2.7, but when I call the flintrock command it works just fine (and again worked completely in launching the cluster in the past)

nchammas commented 7 years ago

The only difference is I used to point all of the configuration/credentials to us-east-1b and am now pointing it to us-east-1a

Hmm... Seems strange! Are the VPCs configured identically across these two different zones? Does the VPC in us-east-1a have an attached Internet gateway? That would explain why you can't SSH into the launched nodes.

I tried installing Flintrock 0.7.0 on the machine but ran into errors, can explore this if in more detail if necessary

What errors are you seeing?

bakztfuture commented 7 years ago

thanks for your help @nchammas!

Creating an internet gateway was not in the set of instructions I was following, so I thought I had my AWS configuration setup correctly. This has clearly not been an issue with Flintrock.

For anyone else wondering, Later, I also had to follow this guide to create a route that points the VPC to the internet gateway. After both those steps, it was no longer timing out and I was able to ssh into the newly created instances.

nchammas commented 7 years ago

Glad that resolved the issue. 👍