nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Flintrock on EC2 server fails to launch cluster (EMR works) #235

Closed steve-drew-strong-bridge closed 6 years ago

steve-drew-strong-bridge commented 6 years ago

We have success starting clusters from an EMR instance. However, our EC2 instances do not have the same success. We have the same versions installed on both, but EC2 consistently returns the following. flintrock --config 1Node4xlarge.yaml launch sdfrtest21 2018-02-16 16:44:57,382 - flintrock.ec2 - INFO - Requesting 2 spot instances at a max price of $0.2... 2018-02-16 16:44:57,966 - flintrock.ec2 - INFO - 0 of 2 instances granted. Waiting... 2018-02-16 16:45:28,348 - flintrock.ec2 - INFO - All 2 instances granted. 2018-02-16 16:45:40,682 - flintrock.ssh - INFO - [54.202.97.216] SSH online. 2018-02-16 16:45:40,776 - flintrock.core - INFO - [54.202.97.216] Configuring ephemeral storage... 2018-02-16 16:45:43,586 - flintrock.ssh - DEBUG - [34.212.225.154] SSH timeout. 2018-02-16 16:45:48,592 - flintrock.ssh - DEBUG - [34.212.225.154] SSH exception: [Errno None] Unable to connect to port 22 on 34.212.225.154 2018-02-16 16:45:53,599 - flintrock.ssh - DEBUG - [34.212.225.154] SSH exception: [Errno None] Unable to connect to port 22 on 34.212.225.154 2018-02-16 16:45:58,605 - flintrock.ssh - DEBUG - [34.212.225.154] SSH exception: [Errno None] Unable to connect to port 22 on 34.212.225.154 2018-02-16 16:46:03,611 - flintrock.ssh - DEBUG - [34.212.225.154] SSH exception: [Errno None] Unable to connect to port 22 on 34.212.225.154 2018-02-16 16:46:08,718 - flintrock.ssh - INFO - [34.212.225.154] SSH online. 2018-02-16 16:46:08,821 - flintrock.core - INFO - [34.212.225.154] Configuring ephemeral storage... 2018-02-16 16:53:20,240 - flintrock.core - INFO - [54.202.97.216] Installing Java 1.8... 2018-02-16 16:53:31,120 - flintrock.services - INFO - [54.202.97.216] Installing HDFS... 2018-02-16 16:53:46,661 - flintrock.services - INFO - [54.202.97.216] Installing Spark... 2018-02-16 16:56:35,341 - flintrock.core - INFO - [34.212.225.154] Installing Java 1.8... 2018-02-16 16:56:52,146 - flintrock.services - INFO - [34.212.225.154] Installing HDFS... 2018-02-16 16:57:04,619 - flintrock.services - INFO - [34.212.225.154] Installing Spark... 2018-02-16 16:57:29,366 - flintrock.services - INFO - [172.31.8.120] Configuring HDFS master... 2018-02-16 16:57:53,220 - flintrock.services - INFO - [172.31.8.120] Configuring Spark master... Do you want to terminate the 2 instances created by this operation? [Y/n]: y Terminating instances... Traceback (most recent call last): File "/usr/lib64/python3.4/urllib/request.py", line 1183, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "/usr/lib64/python3.4/http/client.py", line 1137, in request self._send_request(method, url, body, headers) File "/usr/lib64/python3.4/http/client.py", line 1182, in _send_request self.endheaders(body) File "/usr/lib64/python3.4/http/client.py", line 1133, in endheaders self._send_output(message_body) File "/usr/lib64/python3.4/http/client.py", line 963, in _send_output self.send(msg) File "/usr/lib64/python3.4/http/client.py", line 898, in send self.connect() File "/usr/lib64/python3.4/http/client.py", line 871, in connect self.timeout, self.source_address) File "/usr/lib64/python3.4/socket.py", line 516, in create_connection raise err File "/usr/lib64/python3.4/socket.py", line 507, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib64/python3.4/site-packages/flintrock/services.py", line 233, in health_check .urlopen(hdfs_master_ui) File "/usr/lib64/python3.4/urllib/request.py", line 161, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python3.4/urllib/request.py", line 464, in open response = self._open(req, data) File "/usr/lib64/python3.4/urllib/request.py", line 482, in _open '_open', req) File "/usr/lib64/python3.4/urllib/request.py", line 442, in _call_chain result = func(*args) File "/usr/lib64/python3.4/urllib/request.py", line 1211, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib64/python3.4/urllib/request.py", line 1185, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/bin/flintrock", line 11, in sys.exit(main()) File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 1132, in main cli(obj={}) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.4/site-packages/click/core.py", line 535, in invoke return callback(args, kwargs) File "/usr/local/lib/python3.4/site-packages/click/decorators.py", line 17, in new_func return f(get_current_context(), *args, *kwargs) File "/usr/local/lib64/python3.4/site-packages/flintrock/flintrock.py", line 403, in launch tags=ec2_tags) File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 53, in wrapper res = func(args, kwargs) File "/usr/local/lib64/python3.4/site-packages/flintrock/ec2.py", line 954, in launch identity_file=identity_file) File "/usr/local/lib64/python3.4/site-packages/flintrock/core.py", line 647, in provision_cluster service.health_check(master_host=cluster.master_host) File "/usr/local/lib64/python3.4/site-packages/flintrock/services.py", line 238, in health_check raise Exception("HDFS health check failed.") from e Exception: HDFS health check failed.

nchammas commented 6 years ago

Seems like you have some weird things going on with networking. When you run Flintrock from EC2 vs. from EMR, are they running in the same VPC and subnet? What if you tried from the same VPC and subnet?

steve-drew-strong-bridge commented 6 years ago

Yes, they are in the same VPC. Oddly, it feels like it's an AMI issue with whatever AMI created these EC2 instances. Other than this, they seem to perform identically using the AWS CLI, Spark, Hadoop, etc.

However, if I go to one of the EC2 instances created by Flintrock, I am able to install Flintrock and start other clusters. So, it's odd to spin up EMR once to get the ball rolling, but after that we seem to be functional. :-)

nchammas commented 6 years ago

That's really strange. It looks like an issue with networking, which would most likely be caused by the VPC, the subnet, or some security group rules. Is the subnet the same, too?

Also, does Flintrock work fine when running from a user's local workstation against EC2?

nchammas commented 6 years ago

@steve-drew-strong-bridge - Do you have more information to share on this issue, specifically regarding the 2 questions from my latest comment? If not, I'll have to close this since there isn't enough information here to help me figure out what's going on.

steve-drew-strong-bridge commented 6 years ago

@nchammas, you can close this one. I've got a solution we can use with the minimal cost EMR instance we use to spin up our clusters. When I have a minute I'll spin up a docker instance for my workstation and try spinning it up from there as well.

nchammas commented 6 years ago

OK. I'll close this issue. If you have additional details to share about what's going on here, I'm happy to reopen it and continue debugging with you.