failed cluster creation

asafcombo commented 8 years ago

Whatever I do, I always get:

Configuring Spark master...
Spark health check failed.
There was a problem with the launch. Cleaning up...
Do you want to terminate the 2 instances created by this operation? [Y/n]

I tried clicking n, and was able to login to master, but I am pretty sure not all configuration is ok. I repeated this process several time.

I tried installing flintrock on both amazon AMI and ubuntu. the launch command is: ./flintrock launch c1 --num-slaves 1 --spark-version 1.6.1 --ec2-key-name name --ec2-identity-file /home/ec2-user/flint/k.pem --ec2-ami ami-08111162 --ec2-user ec2-user

nchammas commented 8 years ago

Hmm, sorry you're not getting a more specific error message to debug with.

What version of Flintrock are you running?
```
flintrock --version
```
What do you see in the master log when you log in to the cluster?

For example:
```
less spark/logs/spark-ec2-user-org.apache.spark.deploy.master.Master-....out
```
That should give you some clues as to what's going wrong.

asafcombo commented 8 years ago

I'm using the new amazon AMI ami-08111162 to run flintrock. Also, I use the standalone build with version 0.4.0.

Nothing in the logs seems wrong: master:

Spark Command: /usr/lib/jvm/jre/bin/java -cp /home/ec2-user/spark/conf/:/home/ec2-user/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/ec2-user/spark/lib/datanucleus-rdbms-3.2.9.jar:/home/ec2-user/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/ec2-user/spark/lib/datanucleus-core-3.2.10.jar:/home/ec2-user/hadoop/conf -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ec2-54-165-104-183.compute-1.amazonaws.com --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/10 11:08:01 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/04/10 11:08:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/10 11:08:03 INFO SecurityManager: Changing view acls to: ec2-user
16/04/10 11:08:03 INFO SecurityManager: Changing modify acls to: ec2-user
16/04/10 11:08:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ec2-user); users with modify permissions: Set(ec2-user)
16/04/10 11:08:04 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/04/10 11:08:04 INFO Master: Starting Spark master at spark://ec2-54-165-104-183.compute-1.amazonaws.com:7077
16/04/10 11:08:04 INFO Master: Running Spark version 1.6.1
16/04/10 11:08:05 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/04/10 11:08:05 INFO MasterWebUI: Started MasterWebUI at http://ec2-54-165-104-183.compute-1.amazonaws.com:8080
16/04/10 11:08:05 INFO Utils: Successfully started service on port 6066.
16/04/10 11:08:05 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/04/10 11:08:05 INFO Master: I have been elected leader! New state: ALIVE
16/04/10 11:08:12 INFO Master: Registering worker 172.31.52.220:40284 with 1 cores, 2.7 GB RAM

worker:

Spark Command: /usr/lib/jvm/jre/bin/java -cp /home/ec2-user/spark/conf/:/home/ec2-user/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/ec2-user/spark/lib/datanucleus-rdbms-3.2.9.jar:/home/ec2-user/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/ec2-user/spark/lib/datanucleus-core-3.2.10.jar:/home/ec2-user/hadoop/conf -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ec2-54-165-104-183.compute-1.amazonaws.com:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/10 11:08:08 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
16/04/10 11:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/10 11:08:10 INFO SecurityManager: Changing view acls to: ec2-user
16/04/10 11:08:10 INFO SecurityManager: Changing modify acls to: ec2-user
16/04/10 11:08:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ec2-user); users with modify permissions: Set(ec2-user)
16/04/10 11:08:11 INFO Utils: Successfully started service 'sparkWorker' on port 40284.
16/04/10 11:08:11 INFO Worker: Starting Spark worker 172.31.52.220:40284 with 1 cores, 2.7 GB RAM
16/04/10 11:08:11 INFO Worker: Running Spark version 1.6.1
16/04/10 11:08:11 INFO Worker: Spark home: /home/ec2-user/spark
16/04/10 11:08:11 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/04/10 11:08:11 INFO WorkerWebUI: Started WorkerWebUI at http://ec2-52-90-115-54.compute-1.amazonaws.com:8081
16/04/10 11:08:11 INFO Worker: Connecting to master ec2-54-165-104-183.compute-1.amazonaws.com:7077...
16/04/10 11:08:12 INFO Worker: Successfully registered with master spark://ec2-54-165-104-183.compute-1.amazonaws.com:7077

Although when pressing 'n' when asked if I want to destroy I get this:

Traceback (most recent call last):
  File "urllib/request.py", line 1240, in do_open
  File "http/client.py", line 1083, in request
  File "http/client.py", line 1128, in _send_request
  File "http/client.py", line 1079, in endheaders
  File "http/client.py", line 911, in _send_output
  File "http/client.py", line 854, in send
  File "http/client.py", line 826, in connect
  File "socket.py", line 707, in create_connection
  File "socket.py", line 698, in create_connection
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 11, in <module>
  File "flintrock/flintrock.py", line 871, in main
  File "site-packages/click/core.py", line 716, in __call__
  File "site-packages/click/core.py", line 696, in main
  File "site-packages/click/core.py", line 1060, in invoke
  File "site-packages/click/core.py", line 889, in invoke
  File "site-packages/click/core.py", line 534, in invoke
  File "site-packages/click/decorators.py", line 17, in new_func
  File "flintrock/flintrock.py", line 322, in launch
  File "flintrock/ec2.py", line 46, in wrapper
  File "flintrock/ec2.py", line 629, in launch
  File "flintrock/core.py", line 431, in provision_cluster
  File "flintrock/services.py", line 333, in health_check
  File "urllib/request.py", line 162, in urlopen
  File "urllib/request.py", line 465, in open
  File "urllib/request.py", line 483, in _open
  File "urllib/request.py", line 443, in _call_chain
  File "urllib/request.py", line 1268, in http_open
  File "urllib/request.py", line 1242, in do_open
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
standalone returned -1

nchammas commented 8 years ago

Hmm, so as you pointed out the log indicates that Spark is starting up fine. If you are on the master and run curl localhost:8080 do you get something back?

Flintrock is having trouble connecting to that same address (i.e. spark.master.ip:8080) from the machine where you are running Flintrock.

Next debugging steps:

Do you have a firewall that may be blocking this?
How about if you just run curl spark.master.ip:8080 from your local machine? Does that work?
Can you confirm that the flintrock security group has port 8080 open to your local machine's IP address (curl icanhazip.com)?

asafcombo commented 8 years ago

Well, curl localhost:8080 returns me an answer. Also does curl masterip:8080. The answer is:

<!DOCTYPE html><html>
      <head>
        <meta http-equiv="Content-type" content="text/html; charset=utf-8"/><link rel="stylesheet" href="/static/bootstrap.min.css" type="text/css"/><link rel="stylesheet" href="/static/vis.min.css" type="text/css"/><link rel="stylesheet" href="/static/webui.css" type="text/css"/><link rel="stylesheet" href="/static/timeline-view.css" type="text/css"/><script src="/static/sorttable.js"></script><script src="/static/jquery-1.11.1.min.js"></script><script src="/static/vis.min.js"></script><script src="/static/bootstrap-tooltip.js"></script><script src="/static/initialize-tooltips.js"></script><script src="/static/table.js"></script><script src="/static/additional-metrics.js"></script><script src="/static/timeline-view.js"></script>
        <title>Spark Master at spark://ec2-54-85-153-79.compute-1.amazonaws.com:7077</title>
      </head>
      <body>
        <div class="container-fluid">
          <div class="row-fluid">
            <div class="span12">
              <h3 style="vertical-align: middle; display: inline-block;">
                <a style="text-decoration: none" href="/">
                  <img src="/static/spark-logo-77x50px-hd.png"/>
                  <span class="version" style="margin-right: 15px;">1.6.1</span>
                </a>
                Spark Master at spark://ec2-54-85-153-79.compute-1.amazonaws.com:7077
              </h3>
            </div>
          </div>
          <div class="row-fluid">
          <div class="span12">
            <ul class="unstyled">
              <li><strong>URL:</strong> spark://ec2-54-85-153-79.compute-1.amazonaws.com:7077</li>
              <li>
                    <strong>REST URL:</strong> spark://ec2-54-85-153-79.compute-1.amazonaws.com:6066
                    <span class="rest-uri"> (cluster mode)</span>
                  </li>
              <li><strong>Alive Workers:</strong> 1</li>
              <li><strong>Cores in use:</strong> 1 Total,
                0 Used</li>
              <li><strong>Memory in use:</strong>
                2.7 GB Total,
.
.
.

Also the group is open to my i.p.

Beside the original issue , is there a way for me to verify that the installation of spark and its cluster went o.k ( assuming the only thing that went wrong is the verification ) ?

update : I finally succeeded in opening a cluster. What I did is in the moment I sent the launch command, I went to aws and opened the security group to all ips on all ports. This is a strange behavior as the curl to master:8080 before shows us that the security configuration is ok.

nchammas commented 8 years ago

That's weird.

The original error, Spark health check failed., gets raised when Flintrock can't query the Spark Web UI. Doing curl master.ip:8080 manually is basically the same thing.

The only other explanation for what happened that I can think of is that Spark, for some reason, took a really long time to start up. So the health check failed when it ran, but if you just wait an extra minute or two and check manually, it's fine. I'm not sure why/when that would happen, though.

Anyway, glad you got things working. I'm going to close this issue as "Can't reproduce", but feel free to reopen it with more details if you find something.

nchammas / flintrock

failed cluster creation #107