nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Exception: Spark health check failed #288

Closed Fuzzy-sh closed 4 years ago

Fuzzy-sh commented 5 years ago

Flintrock version: 11.0 Python version: 3.5.5 / anaconda OS: Amazon Linux 2 AMI 2.0.20181114 x86_64 HVM ebs - ami-0b8d0d6ac70e5750c region of the instance is N. Virginia

Config.yaml

services:
  spark:
      version: 2.3.3
    # git-commit: latest  # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
    # git-repository:  # optional; defaults to https://github.com/apache/spark
    # optional; defaults to download from from the official Spark S3 bucket
    #   - must contain a {v} template corresponding to the version
    #   - Spark must be pre-built
    #   - must be a tar.gz file
    # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
       download-source: "https://www-us.apache.org/dist/spark/spark-{v}/spark-{v}-bin-hadoop2.7.tgz"
    # executor-instances: 1
  hdfs:
      version: 2.8.5
    # optional; defaults to download from a dynamically selected Apache mirror
    #   - must contain a {v} template corresponding to the version
    #   - must be a .tar.gz file
    # download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
    # download-source: "http://www-us.apache.org/dist/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz"
provider: ec2

providers:
  ec2:
    key-name: Key-flint
    identity-file: /home/ec2-user/.ssh/Key-flint.pem
    instance-type: t2.micro
    region: us-east-1
    # availability-zone: <name>
    ami: ami-0b8d0d6ac70e5750c  # Amazon Linux 2, us-east-1
    user: ec2-user
    # ami: ami-61bbf104  # CentOS 7, us-east-1
    # user: centos
    # spot-price: <price>
    # vpc-id: <id>
    # subnet-id: <id>
    # placement-group: <name>
    # security-groups:
    #   - group-name1
    #   - group-name2
    # instance-profile-name:
    # tags:
    #   - key1,value1
    #   - key2, value2  # leading/trailing spaces are trimmed
    #   - key3,  # value will be empty
    # min-root-ebs-size-gb: <size-gb>
    tenancy: default  # default | dedicated
    ebs-optimized: no  # yes | no
    instance-initiated-shutdown-behavior: terminate  # terminate | stop
    # user-data: /path/to/userdata/script

launch:
  num-slaves: 1
  # install-hdfs: True
  install-spark: True

debug: false
Fuzzy-sh commented 5 years ago

Hello, dear @nchammas I have got this error image

After pressing y image ............................................................. Have seen in the config.yaml that the source should be in .tar.gz format --> # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz" I could not find any spark-2.3.3.tar.gz

Fuzzy-sh commented 5 years ago

Dear @nchammas Here is the output when the config.yaml, install-spark is False. It works fine.

[ec2-user@ip-172-31-39-241 ~]$ flintrock launch flint-test
2019-06-06 04:17:34,407 - flintrock.ec2       - INFO  - Launching 2 instances...
2019-06-06 04:17:46,361 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0c1a896d36dac451b', 'i-0f615176da68db3bc', ...
2019-06-06 04:17:49,499 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0c1a896d36dac451b', 'i-0f615176da68db3bc', ...
2019-06-06 04:17:52,604 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0c1a896d36dac451b', 'i-0f615176da68db3bc', ...
2019-06-06 04:17:55,754 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0c1a896d36dac451b', 'i-0f615176da68db3bc', ...
2019-06-06 04:17:58,885 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0c1a896d36dac451b', 'i-0f615176da68db3bc', ...
2019-06-06 04:18:02,051 - flintrock.ec2       - DEBUG - 1 instances not in state 'running': 'i-0c1a896d36dac451b', ...
2019-06-06 04:18:05,154 - flintrock.ec2       - DEBUG - 1 instances not in state 'running': 'i-0c1a896d36dac451b', ...
2019-06-06 04:18:11,352 - flintrock.ssh       - DEBUG - [54.196.73.100] SSH timeout.
2019-06-06 04:18:11,352 - flintrock.ssh       - DEBUG - [54.211.211.220] SSH timeout.
2019-06-06 04:18:16,358 - flintrock.ssh       - DEBUG - [54.196.73.100] SSH exception: [Errno None] Unable to connect to port 22 on 54.196.73.100
2019-06-06 04:18:16,639 - flintrock.ssh       - INFO  - [54.211.211.220] SSH online.
2019-06-06 04:18:16,944 - flintrock.core      - INFO  - [54.211.211.220] Configuring ephemeral storage...
2019-06-06 04:18:17,384 - flintrock.core      - INFO  - [54.211.211.220] Installing Java 1.8...
2019-06-06 04:18:21,555 - flintrock.ssh       - INFO  - [54.196.73.100] SSH online.
2019-06-06 04:18:21,838 - flintrock.core      - INFO  - [54.196.73.100] Configuring ephemeral storage...
2019-06-06 04:18:22,481 - flintrock.core      - INFO  - [54.196.73.100] Installing Java 1.8...
2019-06-06 04:18:49,758 - flintrock.ec2       - INFO  - launch finished in 0:01:19.
Cluster master: ec2-54-196-73-100.compute-1.amazonaws.com
Login with: flintrock login flint-test

However when I change it to install-spark: True Then this is the error ( the ip address of the master and the slave changes)

ec2-user@ip-172-31-39-241 ~]$ flintrock launch flint-test
2019-06-06 04:20:49,442 - flintrock.ec2       - INFO  - Launching 2 instances...
2019-06-06 04:21:01,329 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0184d608d100651fe', 'i-06d89fac19734310e', ...
2019-06-06 04:21:04,462 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0184d608d100651fe', 'i-06d89fac19734310e', ...
2019-06-06 04:21:07,559 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0184d608d100651fe', 'i-06d89fac19734310e', ...
2019-06-06 04:21:10,671 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0184d608d100651fe', 'i-06d89fac19734310e', ...
2019-06-06 04:21:13,775 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-0184d608d100651fe', 'i-06d89fac19734310e', ...
2019-06-06 04:21:19,964 - flintrock.ssh       - DEBUG - [52.70.94.72] SSH timeout.
2019-06-06 04:21:19,964 - flintrock.ssh       - DEBUG - [3.83.155.41] SSH timeout.
2019-06-06 04:21:24,968 - flintrock.ssh       - DEBUG - [52.70.94.72] SSH exception: [Errno None] Unable to connect to port 22 on 52.70.94.72
2019-06-06 04:21:24,969 - flintrock.ssh       - DEBUG - [3.83.155.41] SSH exception: [Errno None] Unable to connect to port 22 on 3.83.155.41
2019-06-06 04:21:30,200 - flintrock.ssh       - INFO  - [52.70.94.72] SSH online.
2019-06-06 04:21:30,256 - flintrock.ssh       - INFO  - [3.83.155.41] SSH online.
2019-06-06 04:21:30,440 - flintrock.core      - INFO  - [52.70.94.72] Configuring ephemeral storage...
2019-06-06 04:21:30,609 - flintrock.core      - INFO  - [3.83.155.41] Configuring ephemeral storage...
2019-06-06 04:21:30,872 - flintrock.core      - INFO  - [52.70.94.72] Installing Java 1.8...
2019-06-06 04:21:31,046 - flintrock.core      - INFO  - [3.83.155.41] Installing Java 1.8...
2019-06-06 04:21:50,228 - flintrock.services  - INFO  - [52.70.94.72] Installing Spark...
2019-06-06 04:22:00,064 - flintrock.services  - INFO  - [3.83.155.41] Installing Spark...
2019-06-06 04:22:07,991 - flintrock.services  - INFO  - [52.70.94.72] Configuring Spark master...
2019-06-06 04:23:38,149 - flintrock.services  - DEBUG - Timed out waiting for Spark master to come up. Trying again...
2019-06-06 04:25:08,243 - flintrock.services  - DEBUG - Timed out waiting for Spark master to come up. Trying again...
Do you want to terminate the 2 instances created by this operation? [Y/n]: y
Terminating instances...
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/home/ec2-user/anaconda3/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/home/ec2-user/anaconda3/lib/python3.5/socket.py", line 711, in create_connection
    raise err
  File "/home/ec2-user/anaconda3/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/services.py", line 415, in health_check
    .urlopen(spark_master_ui)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/home/ec2-user/anaconda3/lib/python3.5/urllib/request.py", line 1256, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/bin/flintrock", line 11, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/flintrock.py", line 1187, in main
    cli(obj={})
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/flintrock.py", line 456, in launch
    tags=ec2_tags)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/ec2.py", line 53, in wrapper
    res = func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/ec2.py", line 955, in launch
    identity_file=identity_file)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/core.py", line 654, in provision_cluster
    service.health_check(master_host=cluster.master_host)
  File "/home/ec2-user/anaconda3/lib/python3.5/site-packages/flintrock/services.py", line 425, in health_check
    raise Exception("Spark health check failed.") from e
Exception: Spark health check failed.
nchammas commented 5 years ago

Sorry about the delay here! Will take a look at this Monday.

Fuzzy-sh commented 5 years ago

Thanks for your consideration. I am looking for hearing from you.

nchammas commented 5 years ago

Have seen in the config.yaml that the source should be in .tar.gz format

.tar.gz and .tgz are interchangeable. Either should work. Just make sure that the URL you configure exists once the version has been substituted in.

Exception: Spark health check failed.

When you see this error, choose not to terminate the instances and instead log in to the cluster master and take a look at the logs under ~/spark/. They should give you more specific details about why the Spark health check failed.

nchammas commented 4 years ago

Any update here, @fazish? Did you look into the logs, per my previous message?

Fuzzy-sh commented 4 years ago

Hello dear @nchammas. Sorry, I could not yet. Working on sth else. If you want you can close the issue, then I will open it in case there are difficulties.

nchammas commented 4 years ago

OK, sounds good to me.

ethanywang commented 4 years ago

Any updates now?