nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

cannot launch cluster - ConnectionResetError causes HDFS health check failed #250

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hello, as the title says I am unable to launch a cluster using flintrock. I am using:

My configuration file is:

services:
  spark:
    version: 2.2.0
  hdfs:
    version: 2.7.6
provider: ec2
providers:
  ec2:
    key-name: spark_cluster
    identity-file: /home/manuel/python3_env/spark_cluster.pem
    instance-type: t2.micro
    region: us-east-1
    ami: ami-97785bed  
    user: ec2-user
    tenancy: default  
    ebs-optimized: no 
    instance-initiated-shutdown-behavior: terminate  
launch:
  num-slaves: 1
  install-hdfs: True
  install-spark: True
debug: false

When I try to launch the cluster using flintrock lanuch my-cluster, the output is:


Launching 2 instances...
/usr/local/lib/python3.5/dist-packages/paramiko/rsakey.py:119: CryptographyDeprecationWarning: signer and verifier have been deprecated. Please use sign and verify instead.
  algorithm=hashes.SHA1(),
/usr/local/lib/python3.5/dist-packages/paramiko/rsakey.py:99: CryptographyDeprecationWarning: signer and verifier have been deprecated. Please use sign and verify instead.
  algorithm=hashes.SHA1(),
[54.172.86.177] SSH online.
[54.172.86.177] Configuring ephemeral storage...
[54.172.86.177] Installing Java 1.8...
[54.158.28.87] SSH online.
[54.158.28.87] Configuring ephemeral storage...
[54.158.28.87] Installing Java 1.8...
[54.172.86.177] Installing HDFS...
[54.158.28.87] Installing HDFS...
[54.172.86.177] Installing Spark...
[54.158.28.87] Installing Spark...
[54.172.86.177] Configuring HDFS master...
[54.172.86.177] Configuring Spark master...
Do you want to terminate the 2 instances created by this operation? [Y/n]: y
Terminating instances...
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flintrock/services.py", line 233, in health_check
    .urlopen(hdfs_master_ui)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.5/urllib/request.py", line 1257, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/flintrock/flintrock.py", line 1132, in main
    cli(obj={})
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/flintrock/flintrock.py", line 403, in launch
    tags=ec2_tags)
  File "/usr/local/lib/python3.5/dist-packages/flintrock/ec2.py", line 53, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/flintrock/ec2.py", line 954, in launch
    identity_file=identity_file)
  File "/usr/local/lib/python3.5/dist-packages/flintrock/core.py", line 647, in provision_cluster
    service.health_check(master_host=cluster.master_host)
  File "/usr/local/lib/python3.5/dist-packages/flintrock/services.py", line 238, in health_check
    raise Exception("HDFS health check failed.") from e
Exception: HDFS health check failed.

I am having a hard time to understand what the problem is here. Any ideas? Thank you!

nchammas commented 6 years ago

It is indeed hard to tell what went wrong here from just that error. What I usually do in cases like this is to instruct Flintrock not to terminate the instances after a failed launch. Then, I login to the master and look at the relevant logs. In this case they'd be under hadoop/logs or something like that.

Do you have the same issues if you run Flintrock from master?

kavika-1 commented 6 years ago

In my case, the "HDFS health check failed" was due to the admin box that launched flint in an ec2 vpc being unable to reach the master on port 50070. To workaround this, I first edited the Flintrock cluster group security group in the aws console, and added TCP port 50070 "Anywhere". This had the added bonus of allowing my own browser to reach the :50070 dfs health page. Note that you will need to add port 8080 as well for the spark health check.

Alternatively, adding the Flintrock cluster group security group to the admin box also works.

The Flintrock base group does indeed add and entry to open these ports to the public ip address of the admin box, but because of how ec2 VPC works, the request comes in bound the private ip (internal) address of the admin box.

ghost commented 6 years ago

Thanks for the answers! I have been able to identify and solve the problem: it was exactly what @kavika-1 said. By adding those rules to flintrock security group in the AWS console, now the cluster sets up properly. Cheers!

nchammas commented 6 years ago

Thanks for chiming in @kavika-1 and glad you figured it out @DalcaTN.

Do either of you know how the IP address of the inbound request appears in your case? I wouldn't want to add new rules to Flintrock to allow traffic from anywhere, but perhaps adding rules to allow traffic from 10.* might be appropriate here.

I'm not sure why a private IP address shows up in the first place for you (if you have any insight on this, I'd appreciate it since others have reported similar issues in the past), but as a compromise solution allowing traffic from private addresses should be fine.

kavika-1 commented 6 years ago

I can confirm that the inbound request ip (from the POV of the master) is the private ip of the admin box. The rules get created with the public ip.

My guess is that in our case, @DalcaTN and I probably both have "Auto-assign Public IP: yes" in our vpc subnet. Thus, the instances get two addresses; one private and one public.

I think spark-ec2 attempted to address such issues with the --private-ips flag?

  --private-ips         Use private IPs for instances rather than public if
                        VPC/subnet requires that.

Maybe a hint here: https://stackoverflow.com/questions/42654336/how-do-i-resolve-failed-to-determine-hostname-of-instance-error-using-spark-ec

ghost commented 6 years ago

Unfortunately I am not as experienced as @kavika-1, but I can tell you that I did not set Auto-assign Public IP to anything. So if "yes" is the default value, then it is the same for me.

nchammas commented 6 years ago

Ah OK. I think the key here is what you pointed out earlier @kavika-1:

the admin box that launched flint in an ec2 vpc

@DalcaTN - Are you also running Flintrock from a box running in EC2?

In this case I wonder if it's better for Flintrock to avoid these kinds of networking issues by querying the healthcheck HTTP endpoints locally from the cluster master, rather than remotely from the Flintrock client.

ghost commented 6 years ago

@nchammas I am not 100% sure about what you mean with box: I am just using Flintrock locally on my laptop to launch the remote cluster on EC2 instances

nchammas commented 6 years ago

OK. I thought maybe you were running Flintrock from EC2.

(By "box running in EC2", I just mean an instance running in EC2.)