Closed pragnesh closed 6 years ago
Since i am launching flintrock cluster from ec2 from same region, i have notice that public dns name switch from public ip address to private ip address while launching cluster. Looks like this is the reason even flintrock had already finished some step, it fail to connect because ip address switch. While looking at code in get_ssh_client function number of tries set to 1 if wait parameter is false. which seem quite low i think.
def get_ssh_client(
*,
user: str,
host: str,
identity_file: str,
wait: bool=False,
print_status: bool=None) -> paramiko.client.SSHClient:
"""
Get an SSH client for the provided host, waiting as necessary for SSH to become
available.
"""
if print_status is None:
print_status = wait
client = paramiko.client.SSHClient()
client.load_system_host_keys()
client.set_missing_host_key_policy(paramiko.client.AutoAddPolicy())
if wait:
tries = 100
else:
tries = 1
How is your VPC setup? Flintrock will definitely get confused if the reported address flip-flops between public and private.
Also, what do you see when this happens with --debug
enabled?
we use vpc created by EMR job, so i believe vpc don't have issue, since we daily run emr job along with flintrock job, I am not sure what exacly you want to know about vpc setup.
I haven't tried with --debug
flag i will try it out and post log.
Here is log, It did not failed this time, but i can see that it consistently switch to private ip address while it tried to start hdfs and spark cluster. We did not have this issue earlier but after recent update we started seeing this issue.
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test
2017-05-06 04:06:31,159 - flintrock.ec2 - INFO - Requesting 2 spot instances at a max price of $0.5...
2017-05-06 04:06:31,586 - flintrock.ec2 - INFO - 0 of 2 instances granted. Waiting...
2017-05-06 04:07:01,755 - flintrock.ec2 - INFO - All 2 instances granted.
2017-05-06 04:07:12,780 - flintrock.ssh - DEBUG - [54.254.157.196] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.157.196
2017-05-06 04:07:12,780 - flintrock.ssh - DEBUG - [54.169.232.150] SSH exception: [Errno None] Unable to connect to port 22 on 54.169.232.150
2017-05-06 04:07:17,873 - flintrock.ssh - INFO - [54.254.157.196] SSH online.
2017-05-06 04:07:17,935 - flintrock.ssh - INFO - [54.169.232.150] SSH online.
2017-05-06 04:07:18,002 - flintrock.core - INFO - [54.254.157.196] Configuring ephemeral storage...
2017-05-06 04:07:18,134 - flintrock.core - INFO - [54.169.232.150] Configuring ephemeral storage...
2017-05-06 04:07:18,217 - flintrock.core - INFO - [54.254.157.196] Installing Java 1.8...
2017-05-06 04:07:18,367 - flintrock.core - INFO - [54.169.232.150] Installing Java 1.8...
2017-05-06 04:07:22,517 - flintrock.services - INFO - [54.254.157.196] Installing HDFS...
2017-05-06 04:07:24,193 - flintrock.services - INFO - [54.169.232.150] Installing HDFS...
2017-05-06 04:07:31,739 - flintrock.services - INFO - [54.254.157.196] Installing Spark...
2017-05-06 04:07:32,771 - flintrock.services - INFO - [54.169.232.150] Installing Spark...
2017-05-06 04:08:00,302 - flintrock.services - INFO - [172.30.0.177] Configuring HDFS master...
2017-05-06 04:08:17,865 - flintrock.services - INFO - [172.30.0.177] Configuring Spark master...
2017-05-06 04:08:45,937 - flintrock.services - INFO - HDFS online.
2017-05-06 04:08:45,997 - flintrock.services - INFO - Spark Health Report:
* Master: ALIVE
* Workers: 2
* Cores: 8
* Memory: 57.9 GB
2017-05-06 04:08:46,001 - flintrock.ec2 - INFO - launch finished in 0:02:19.
Here is debug log when it failed and i tried immediately again it succeed,
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test
2017-05-09 04:02:23,129 - flintrock.ec2 - INFO - Requesting 2 spot instances at a max price of $0.5...
2017-05-09 04:02:23,461 - flintrock.ec2 - INFO - 0 of 2 instances granted. Waiting...
2017-05-09 04:02:53,618 - flintrock.ec2 - INFO - All 2 instances granted.
2017-05-09 04:03:04,470 - flintrock.ssh - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:04,470 - flintrock.ssh - DEBUG - [54.255.229.239] SSH exception: [Errno None] Unable to connect to port 22 on 54.255.229.239
2017-05-09 04:03:09,476 - flintrock.ssh - DEBUG - [54.255.229.239] SSH exception: [Errno None] Unable to connect to port 22 on 54.255.229.239
2017-05-09 04:03:09,476 - flintrock.ssh - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:14,482 - flintrock.ssh - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:14,587 - flintrock.ssh - INFO - [54.255.229.239] SSH online.
2017-05-09 04:03:14,811 - flintrock.core - INFO - [54.255.229.239] Configuring ephemeral storage...
2017-05-09 04:03:15,063 - flintrock.core - INFO - [54.255.229.239] Installing Java 1.8...
2017-05-09 04:03:19,594 - flintrock.ssh - INFO - [54.254.195.178] SSH online.
2017-05-09 04:03:19,862 - flintrock.core - INFO - [54.254.195.178] Configuring ephemeral storage...
2017-05-09 04:03:20,124 - flintrock.core - INFO - [54.254.195.178] Installing Java 1.8...
2017-05-09 04:03:21,299 - flintrock.services - INFO - [54.255.229.239] Installing HDFS...
2017-05-09 04:03:26,365 - flintrock.services - INFO - [54.254.195.178] Installing HDFS...
2017-05-09 04:03:30,574 - flintrock.services - INFO - [54.255.229.239] Installing Spark...
2017-05-09 04:03:35,975 - flintrock.services - INFO - [54.254.195.178] Installing Spark...
2017-05-09 04:04:06,200 - flintrock.ssh - DEBUG - [ec2-54-254-195-178.ap-southeast-1.compute.amazonaws.com] SSH timeout.
Do you want to terminate the 2 instances created by this operation? [Y/n]: y
Terminating instances...
[ec2-54-254-195-178.ap-southeast-1.compute.amazonaws.com] Could not connect via SSH.
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test
2017-05-09 04:04:45,036 - flintrock.ec2 - INFO - Requesting 2 spot instances at a max price of $0.5...
2017-05-09 04:04:45,358 - flintrock.ec2 - INFO - 0 of 2 instances granted. Waiting...
2017-05-09 04:05:15,502 - flintrock.ec2 - INFO - All 2 instances granted.
2017-05-09 04:05:26,338 - flintrock.ssh - DEBUG - [13.228.27.57] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.27.57
2017-05-09 04:05:26,338 - flintrock.ssh - DEBUG - [13.228.25.123] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.25.123
2017-05-09 04:05:31,344 - flintrock.ssh - DEBUG - [13.228.25.123] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.25.123
2017-05-09 04:05:31,427 - flintrock.ssh - INFO - [13.228.27.57] SSH online.
2017-05-09 04:05:31,541 - flintrock.core - INFO - [13.228.27.57] Configuring ephemeral storage...
2017-05-09 04:05:31,723 - flintrock.core - INFO - [13.228.27.57] Installing Java 1.8...
2017-05-09 04:05:36,436 - flintrock.ssh - INFO - [13.228.25.123] SSH online.
2017-05-09 04:05:36,552 - flintrock.core - INFO - [13.228.25.123] Configuring ephemeral storage...
2017-05-09 04:05:36,767 - flintrock.core - INFO - [13.228.25.123] Installing Java 1.8...
2017-05-09 04:05:38,253 - flintrock.services - INFO - [13.228.27.57] Installing HDFS...
2017-05-09 04:05:40,975 - flintrock.services - INFO - [13.228.25.123] Installing HDFS...
2017-05-09 04:05:47,262 - flintrock.services - INFO - [13.228.27.57] Installing Spark...
2017-05-09 04:05:49,504 - flintrock.services - INFO - [13.228.25.123] Installing Spark...
2017-05-09 04:06:16,923 - flintrock.services - INFO - [172.30.0.152] Configuring HDFS master...
2017-05-09 04:06:35,650 - flintrock.services - INFO - [172.30.0.152] Configuring Spark master...
2017-05-09 04:07:03,315 - flintrock.services - INFO - HDFS online.
2017-05-09 04:07:03,389 - flintrock.services - INFO - Spark Health Report:
* Master: ALIVE
* Workers: 2
* Cores: 8
* Memory: 57.9 GB
2017-05-09 04:07:03,393 - flintrock.ec2 - INFO - launch finished in 0:02:22.
Cluster master: ec2-13-228-27-57.ap-southeast-1.compute.amazonaws.com
Login with: flintrock login test
ubuntu@ip-172-30-0-42:/flintrock_config$
Hmm, this is strange and I am not sure why it would happen. Is anything about the EMR VPC changing while Flintrock is doing its work? For some reason when Flintrock queries the master IP here it occasionally gets a private IP address.
No, nothing is changing with EMR VPC while flintrock is launching cluster.
I think when some one launch ec2 instance inside VPC with public ip, and if you try to resolve it is public name to dns address from same VPC intially it will give you public ip address but after a minute it switch to private ip address.
I have increase default number of tries from 1 to 5 in flintrock/ssh.py. After this change i haven't seen failed launch.
Closing this issue since @pragnesh has a workaround and since I couldn't get to a root cause.
Can you make this a configuration option within the yaml file so we don't have to find and change the ssh.py file each time we install flintrock on a new instance?
@steve-drew-strong-bridge - Not sure what option specifically you're asking for. Can you clarify?
You shouldn't need to do anything when launching a cluster if your VPC is setup correctly, has an Internet gateway attached, and assigns public IPs.
Sure @nchammas - apologies for the lack of clarity. We still seem to battle with the ssh connection issues. For most cases, if we locate the "tries = 1" section of ssh.py and set it to 5 as suggested in this thread we are able to launch clusters.
The ask here was to make the 'tries' an option in the config file so that we could just update the YAML files we deploy instead of locating the ssh.py script after each install of flintrock. It was a lazy request. :-)
FYI - About 1 in 5 clusters we spin up to use Flintrock continues to have the ssh connection errors while trying to install Java during execution of flintrock launch. We're still trying to figure out exactly what's happening there, but we do note (as pointed out here) that the IP address switches from the internal IP address to the external IP address when it fails.
@steve-drew-strong-bridge - Thanks for elaborating. I suppose until we have more clarity on why Flintrock sometimes sees these private IPs, perhaps it's easiest to just set the default to 3 tries. Or does it really need to be 5? I'd prefer a lower count so that when there is a real issue, the user doesn't have to wait long to find out.
@nchammas, I know it's been a few days, but I'm still trying to track down these odd failures. Regarding the default number of tries, I have a different suggestion. While it's slightly more work, I'd suggest making the number of retries a config setting. That way, you can ship it as 1 which solves most of the deployments. Then, for those of us that are trouble-shooting, we can set it incrementally higher to see when the problem goes away.
That said, I have further oddities that may change your mind on even doing it. :-)
At the risk of falling into the TMI category, I just want to prefix this with the knowledge that we typically spin up a single server to use as our flintrock server. From there we create the clusters. This impacts both discoveries below. (If you'd like a separate ticket for these, let me know.)
I doubt that helps much... But, it's a long explanation of why I wouldn't just change the default setting for everyone.
Hey Steve, thank you for elaborating.
I'd prefer to avoid adding new configs wherever possible, because it adds complexity to the UI and adds backwards-compatibility requirements. There are some places where I've been resisting adding new options where I should probably give way (like allowing users to specify different instance types and spot price settings for the master vs. workers), but in this case I don't see the harm in just upping the default.
If 3 tries (or 5 tries) works for y'all, I'd rather just bump the default and see how that works.
For your second problem with EC2 vs. EMR, please open a new issue here with some technical details so I can help y'all figure out what's going on. It's kinda funny that y'all are using EMR with Flintrock, since one of the reasons someone might use Flintrock is to not have to use EMR! 😄