nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Flintrock on EC2 SSH Timeout Errors #317

Closed rlaabs closed 3 years ago

rlaabs commented 3 years ago

When launching Flintrock on a ec2 host I always get SSH Timeout errors. From the debug logs it looks like it is having issues connecting to the master, but I'm not sure why I still got the 'SSH Online' info.

Launching on a local machine with the exact same key and all other settings works correctly.

Hopefully someone can point out where I should dig deeper?

Debug output: 2020-12-04 16:09:40,405 - flintrock.ec2 - INFO - Requesting 3 spot instances at a max price of $0.65... 2020-12-04 16:09:41,330 - flintrock.ec2 - INFO - 0 of 3 instances granted. Waiting... 2020-12-04 16:10:11,552 - flintrock.ec2 - INFO - All 3 instances granted. 2020-12-04 16:10:22,454 - flintrock.ssh - INFO - [13.52.61.215] SSH online. 2020-12-04 16:10:22,460 - flintrock.ssh - INFO - [3.101.16.23] SSH online. 2020-12-04 16:10:22,462 - flintrock.ssh - INFO - [13.57.224.239] SSH online. 2020-12-04 16:10:22,592 - flintrock.core - INFO - [13.52.61.215] Configuring ephemeral storage... 2020-12-04 16:10:22,602 - flintrock.core - INFO - [3.101.16.23] Configuring ephemeral storage... 2020-12-04 16:10:22,605 - flintrock.core - INFO - [13.57.224.239] Configuring ephemeral storage... 2020-12-04 16:10:22,811 - flintrock.core - INFO - [13.52.61.215] Installing Java 1.8... 2020-12-04 16:10:22,835 - flintrock.core - INFO - [13.57.224.239] Installing Java 1.8... 2020-12-04 16:10:22,873 - flintrock.core - INFO - [3.101.16.23] Installing Java 1.8... 2020-12-04 16:10:40,607 - flintrock.services - INFO - [13.52.61.215] Installing HDFS... 2020-12-04 16:10:40,770 - flintrock.services - INFO - [3.101.16.23] Installing HDFS... 2020-12-04 16:10:40,925 - flintrock.services - INFO - [13.57.224.239] Installing HDFS... 2020-12-04 16:10:54,909 - flintrock.services - INFO - [13.52.61.215] Installing Spark... 2020-12-04 16:10:55,262 - flintrock.services - INFO - [3.101.16.23] Installing Spark... 2020-12-04 16:10:55,394 - flintrock.services - INFO - [13.57.224.239] Installing Spark... 2020-12-04 16:11:05,131 - flintrock.ssh - DEBUG - [ec2-13-52-61-215.us-west-1.compute.amazonaws.com] SSH timeout. 2020-12-04 16:11:13,140 - flintrock.ssh - DEBUG - [ec2-13-52-61-215.us-west-1.compute.amazonaws.com] SSH timeout. 2020-12-04 16:11:21,150 - flintrock.ssh - DEBUG - [ec2-13-52-61-215.us-west-1.compute.amazonaws.com] SSH timeout. Terminating instances... [ec2-13-52-61-215.us-west-1.compute.amazonaws.com] Could not connect via SSH.

nchammas commented 3 years ago

Hmm, it's weird that it connects to 13.52.61.215 successfully and starts installing stuff, but only later starts failing.

I see that it connects via IP at the start and then by DNS name towards the end. Maybe that's the issue?

Can you try using Flintrock off the master branch? There are some potentially related changes that went in as part of #285, but they may not impact what you're seeing here. It's worth a test, though.

rlaabs commented 3 years ago

Tried again with 1.1.0.dev0, still seeing the same issue. Still digging, hopefully I'm just missing something simple.

nchammas commented 3 years ago

It looks like the switch from IP address to DNS name happens when Flintrock starts installing Spark.

Are you able to get a cluster to successfully launch with --no-install-spark? How about with that as well as --no-install-hadoop?

Obviously, that's not a fix, but the exercise might provide some interesting information.

rlaabs commented 3 years ago

When I tried launching with both --no-install-spark and --no-install-hadoop I got the same errors.

So in core.provision_cluster I changed the get_ssh_client call host to cluster.master_ip. I also changed the health check to use the IP instead of the host.

It looks like it is able to provision the cluster and install HDFS and Spark with those changes. Or at least I didn't get any errors (I haven't tried running any jobs with it yet).

I'm not familiar enough with the code base yet to know if this is a viable fix. Do you anticipate any issues with this?

nchammas commented 3 years ago

That's a very good lead! I hit some weirdness in the past configuring Spark to use IP addresses vs. host names. You can see some of that history over on #43. I never got to the bottom of it.

I think your fix is worth submitting as a PR so we can discuss and test it in more detail. Would you be interested in doing that?

rlaabs commented 3 years ago

That's a very good lead! I hit some weirdness in the past configuring Spark to use IP addresses vs. host names. You can see some of that history over on #43. I never got to the bottom of it.

I think your fix is worth submitting as a PR so we can discuss and test it in more detail. Would you be interested in doing that?

Happy to!

(Sorry for the delay, I got caught up in another possible issue with the Java version and the AWS SDK.)