nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

cluster launch fail randomly #198

Closed pragnesh closed 6 years ago

pragnesh commented 7 years ago
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --config config.yaml launch test                                                                           
Requesting 2 spot instances at a max price of $0.5...
0 of 2 instances granted. Waiting...
All 2 instances granted.
[54.179.153.219] SSH online.
[54.179.153.219] Configuring ephemeral storage...
[54.179.153.219] Installing Java 1.8...
[54.179.153.219] Installing HDFS...
[52.77.209.225] SSH online.
[52.77.209.225] Configuring ephemeral storage...
[52.77.209.225] Installing Java 1.8...
[54.179.153.219] Installing Spark...
[52.77.209.225] Installing HDFS...
[52.77.209.225] Installing Spark...
Do you want to terminate the 2 instances created by this operation? [Y/n]: Y
Terminating instances...
[ec2-54-179-153-219.ap-southeast-1.compute.amazonaws.com] Could not connect via SSH.
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --config config.yaml launch test
Requesting 2 spot instances at a max price of $0.5...
0 of 2 instances granted. Waiting...
All 2 instances granted.
[52.221.209.187] SSH online.
[52.221.209.187] Configuring ephemeral storage...
[52.221.209.187] Installing Java 1.8...
[54.255.211.120] SSH online.
[54.255.211.120] Configuring ephemeral storage...
[54.255.211.120] Installing Java 1.8...
[52.221.209.187] Installing HDFS...
[54.255.211.120] Installing HDFS...
[52.221.209.187] Installing Spark...
[54.255.211.120] Installing Spark...
[172.30.0.17] Configuring HDFS master...
[172.30.0.17] Configuring Spark master...
HDFS online.
Spark Health Report:
  * Master: ALIVE
  * Workers: 2
  * Cores: 8
  * Memory: 57.9 GB            
launch finished in 0:02:22.
Cluster master: ec2-54-255-211-120.ap-southeast-1.compute.amazonaws.com
Login with: flintrock login test
ubuntu@ip-172-30-0-42:/flintrock_config/$ 
pragnesh commented 7 years ago

Since i am launching flintrock cluster from ec2 from same region, i have notice that public dns name switch from public ip address to private ip address while launching cluster. Looks like this is the reason even flintrock had already finished some step, it fail to connect because ip address switch. While looking at code in get_ssh_client function number of tries set to 1 if wait parameter is false. which seem quite low i think.

def get_ssh_client(
        *,
        user: str,
        host: str,
        identity_file: str,
        wait: bool=False,
        print_status: bool=None) -> paramiko.client.SSHClient:
    """
    Get an SSH client for the provided host, waiting as necessary for SSH to become
    available.
    """
    if print_status is None:
        print_status = wait

    client = paramiko.client.SSHClient()

    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.client.AutoAddPolicy())

    if wait:
        tries = 100
    else:
        tries = 1
nchammas commented 7 years ago

How is your VPC setup? Flintrock will definitely get confused if the reported address flip-flops between public and private.

Also, what do you see when this happens with --debug enabled?

pragnesh commented 7 years ago

we use vpc created by EMR job, so i believe vpc don't have issue, since we daily run emr job along with flintrock job, I am not sure what exacly you want to know about vpc setup.

I haven't tried with --debug flag i will try it out and post log.

pragnesh commented 7 years ago

Here is log, It did not failed this time, but i can see that it consistently switch to private ip address while it tried to start hdfs and spark cluster. We did not have this issue earlier but after recent update we started seeing this issue.

ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test                                        
2017-05-06 04:06:31,159 - flintrock.ec2       - INFO  - Requesting 2 spot instances at a max price of $0.5...
2017-05-06 04:06:31,586 - flintrock.ec2       - INFO  - 0 of 2 instances granted. Waiting...
2017-05-06 04:07:01,755 - flintrock.ec2       - INFO  - All 2 instances granted.
2017-05-06 04:07:12,780 - flintrock.ssh       - DEBUG - [54.254.157.196] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.157.196
2017-05-06 04:07:12,780 - flintrock.ssh       - DEBUG - [54.169.232.150] SSH exception: [Errno None] Unable to connect to port 22 on 54.169.232.150
2017-05-06 04:07:17,873 - flintrock.ssh       - INFO  - [54.254.157.196] SSH online.
2017-05-06 04:07:17,935 - flintrock.ssh       - INFO  - [54.169.232.150] SSH online.
2017-05-06 04:07:18,002 - flintrock.core      - INFO  - [54.254.157.196] Configuring ephemeral storage...
2017-05-06 04:07:18,134 - flintrock.core      - INFO  - [54.169.232.150] Configuring ephemeral storage...
2017-05-06 04:07:18,217 - flintrock.core      - INFO  - [54.254.157.196] Installing Java 1.8...
2017-05-06 04:07:18,367 - flintrock.core      - INFO  - [54.169.232.150] Installing Java 1.8...
2017-05-06 04:07:22,517 - flintrock.services  - INFO  - [54.254.157.196] Installing HDFS...
2017-05-06 04:07:24,193 - flintrock.services  - INFO  - [54.169.232.150] Installing HDFS...
2017-05-06 04:07:31,739 - flintrock.services  - INFO  - [54.254.157.196] Installing Spark...
2017-05-06 04:07:32,771 - flintrock.services  - INFO  - [54.169.232.150] Installing Spark...
2017-05-06 04:08:00,302 - flintrock.services  - INFO  - [172.30.0.177] Configuring HDFS master...
2017-05-06 04:08:17,865 - flintrock.services  - INFO  - [172.30.0.177] Configuring Spark master...
2017-05-06 04:08:45,937 - flintrock.services  - INFO  - HDFS online.
2017-05-06 04:08:45,997 - flintrock.services  - INFO  - Spark Health Report:
  * Master: ALIVE
  * Workers: 2
  * Cores: 8
  * Memory: 57.9 GB            
2017-05-06 04:08:46,001 - flintrock.ec2       - INFO  - launch finished in 0:02:19.
pragnesh commented 7 years ago

Here is debug log when it failed and i tried immediately again it succeed,

ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test
2017-05-09 04:02:23,129 - flintrock.ec2       - INFO  - Requesting 2 spot instances at a max price of $0.5...
2017-05-09 04:02:23,461 - flintrock.ec2       - INFO  - 0 of 2 instances granted. Waiting...
2017-05-09 04:02:53,618 - flintrock.ec2       - INFO  - All 2 instances granted.
2017-05-09 04:03:04,470 - flintrock.ssh       - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:04,470 - flintrock.ssh       - DEBUG - [54.255.229.239] SSH exception: [Errno None] Unable to connect to port 22 on 54.255.229.239
2017-05-09 04:03:09,476 - flintrock.ssh       - DEBUG - [54.255.229.239] SSH exception: [Errno None] Unable to connect to port 22 on 54.255.229.239
2017-05-09 04:03:09,476 - flintrock.ssh       - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:14,482 - flintrock.ssh       - DEBUG - [54.254.195.178] SSH exception: [Errno None] Unable to connect to port 22 on 54.254.195.178
2017-05-09 04:03:14,587 - flintrock.ssh       - INFO  - [54.255.229.239] SSH online.
2017-05-09 04:03:14,811 - flintrock.core      - INFO  - [54.255.229.239] Configuring ephemeral storage...
2017-05-09 04:03:15,063 - flintrock.core      - INFO  - [54.255.229.239] Installing Java 1.8...
2017-05-09 04:03:19,594 - flintrock.ssh       - INFO  - [54.254.195.178] SSH online.
2017-05-09 04:03:19,862 - flintrock.core      - INFO  - [54.254.195.178] Configuring ephemeral storage...
2017-05-09 04:03:20,124 - flintrock.core      - INFO  - [54.254.195.178] Installing Java 1.8...
2017-05-09 04:03:21,299 - flintrock.services  - INFO  - [54.255.229.239] Installing HDFS...
2017-05-09 04:03:26,365 - flintrock.services  - INFO  - [54.254.195.178] Installing HDFS...
2017-05-09 04:03:30,574 - flintrock.services  - INFO  - [54.255.229.239] Installing Spark...
2017-05-09 04:03:35,975 - flintrock.services  - INFO  - [54.254.195.178] Installing Spark...
2017-05-09 04:04:06,200 - flintrock.ssh       - DEBUG - [ec2-54-254-195-178.ap-southeast-1.compute.amazonaws.com] SSH timeout.
Do you want to terminate the 2 instances created by this operation? [Y/n]: y
Terminating instances...
[ec2-54-254-195-178.ap-southeast-1.compute.amazonaws.com] Could not connect via SSH.
ubuntu@ip-172-30-0-42:/flintrock_config$ flintrock --debug --config config.yaml launch test
2017-05-09 04:04:45,036 - flintrock.ec2       - INFO  - Requesting 2 spot instances at a max price of $0.5...
2017-05-09 04:04:45,358 - flintrock.ec2       - INFO  - 0 of 2 instances granted. Waiting...
2017-05-09 04:05:15,502 - flintrock.ec2       - INFO  - All 2 instances granted.
2017-05-09 04:05:26,338 - flintrock.ssh       - DEBUG - [13.228.27.57] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.27.57
2017-05-09 04:05:26,338 - flintrock.ssh       - DEBUG - [13.228.25.123] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.25.123
2017-05-09 04:05:31,344 - flintrock.ssh       - DEBUG - [13.228.25.123] SSH exception: [Errno None] Unable to connect to port 22 on 13.228.25.123
2017-05-09 04:05:31,427 - flintrock.ssh       - INFO  - [13.228.27.57] SSH online.
2017-05-09 04:05:31,541 - flintrock.core      - INFO  - [13.228.27.57] Configuring ephemeral storage...
2017-05-09 04:05:31,723 - flintrock.core      - INFO  - [13.228.27.57] Installing Java 1.8...
2017-05-09 04:05:36,436 - flintrock.ssh       - INFO  - [13.228.25.123] SSH online.
2017-05-09 04:05:36,552 - flintrock.core      - INFO  - [13.228.25.123] Configuring ephemeral storage...
2017-05-09 04:05:36,767 - flintrock.core      - INFO  - [13.228.25.123] Installing Java 1.8...
2017-05-09 04:05:38,253 - flintrock.services  - INFO  - [13.228.27.57] Installing HDFS...
2017-05-09 04:05:40,975 - flintrock.services  - INFO  - [13.228.25.123] Installing HDFS...
2017-05-09 04:05:47,262 - flintrock.services  - INFO  - [13.228.27.57] Installing Spark...
2017-05-09 04:05:49,504 - flintrock.services  - INFO  - [13.228.25.123] Installing Spark...
2017-05-09 04:06:16,923 - flintrock.services  - INFO  - [172.30.0.152] Configuring HDFS master...
2017-05-09 04:06:35,650 - flintrock.services  - INFO  - [172.30.0.152] Configuring Spark master...
2017-05-09 04:07:03,315 - flintrock.services  - INFO  - HDFS online.
2017-05-09 04:07:03,389 - flintrock.services  - INFO  - Spark Health Report:
  * Master: ALIVE
  * Workers: 2
  * Cores: 8
  * Memory: 57.9 GB            
2017-05-09 04:07:03,393 - flintrock.ec2       - INFO  - launch finished in 0:02:22.
Cluster master: ec2-13-228-27-57.ap-southeast-1.compute.amazonaws.com
Login with: flintrock login test
ubuntu@ip-172-30-0-42:/flintrock_config$
nchammas commented 7 years ago

Hmm, this is strange and I am not sure why it would happen. Is anything about the EMR VPC changing while Flintrock is doing its work? For some reason when Flintrock queries the master IP here it occasionally gets a private IP address.

pragnesh commented 7 years ago

No, nothing is changing with EMR VPC while flintrock is launching cluster.

I think when some one launch ec2 instance inside VPC with public ip, and if you try to resolve it is public name to dns address from same VPC intially it will give you public ip address but after a minute it switch to private ip address.

I have increase default number of tries from 1 to 5 in flintrock/ssh.py. After this change i haven't seen failed launch.

nchammas commented 6 years ago

Closing this issue since @pragnesh has a workaround and since I couldn't get to a root cause.

steve-drew-strong-bridge commented 6 years ago

Can you make this a configuration option within the yaml file so we don't have to find and change the ssh.py file each time we install flintrock on a new instance?

nchammas commented 6 years ago

@steve-drew-strong-bridge - Not sure what option specifically you're asking for. Can you clarify?

You shouldn't need to do anything when launching a cluster if your VPC is setup correctly, has an Internet gateway attached, and assigns public IPs.

steve-drew-strong-bridge commented 6 years ago

Sure @nchammas - apologies for the lack of clarity. We still seem to battle with the ssh connection issues. For most cases, if we locate the "tries = 1" section of ssh.py and set it to 5 as suggested in this thread we are able to launch clusters.

The ask here was to make the 'tries' an option in the config file so that we could just update the YAML files we deploy instead of locating the ssh.py script after each install of flintrock. It was a lazy request. :-)

FYI - About 1 in 5 clusters we spin up to use Flintrock continues to have the ssh connection errors while trying to install Java during execution of flintrock launch. We're still trying to figure out exactly what's happening there, but we do note (as pointed out here) that the IP address switches from the internal IP address to the external IP address when it fails.

nchammas commented 6 years ago

@steve-drew-strong-bridge - Thanks for elaborating. I suppose until we have more clarity on why Flintrock sometimes sees these private IPs, perhaps it's easiest to just set the default to 3 tries. Or does it really need to be 5? I'd prefer a lower count so that when there is a real issue, the user doesn't have to wait long to find out.

steve-drew-strong-bridge commented 6 years ago

@nchammas, I know it's been a few days, but I'm still trying to track down these odd failures. Regarding the default number of tries, I have a different suggestion. While it's slightly more work, I'd suggest making the number of retries a config setting. That way, you can ship it as 1 which solves most of the deployments. Then, for those of us that are trouble-shooting, we can set it incrementally higher to see when the problem goes away.

That said, I have further oddities that may change your mind on even doing it. :-)

At the risk of falling into the TMI category, I just want to prefix this with the knowledge that we typically spin up a single server to use as our flintrock server. From there we create the clusters. This impacts both discoveries below. (If you'd like a separate ticket for these, let me know.)

1 - The longer the server is up and running, the lower we can set the retries. We may have it up around 5 for the first day the server has been running, but after a week it is 1 and never seems to fail. (This is why I'd make it a config setting. )

2 - Over the past two days, I've been working with another dev who just can't seem to get a cluster to work. We've run multiple scenarios but what it seems to come down to is that his single server was created using just EC2, while I created mine on EMR. We've run multiple tests with him connecting to my EMR instance and spinning up clusters (successfully) and then using the same YAML file on his EC2 server (unsuccessfully.) We have spun up multiple EMR servers and generated clusters successfully, but not a single EC2 server has worked regardless of how high we set the retry. So, we're proceeding with the EMR server to step around the issue, but I get his complaint that it costs slightly more.

I doubt that helps much... But, it's a long explanation of why I wouldn't just change the default setting for everyone.

nchammas commented 6 years ago

Hey Steve, thank you for elaborating.

I'd prefer to avoid adding new configs wherever possible, because it adds complexity to the UI and adds backwards-compatibility requirements. There are some places where I've been resisting adding new options where I should probably give way (like allowing users to specify different instance types and spot price settings for the master vs. workers), but in this case I don't see the harm in just upping the default.

If 3 tries (or 5 tries) works for y'all, I'd rather just bump the default and see how that works.

For your second problem with EC2 vs. EMR, please open a new issue here with some technical details so I can help y'all figure out what's going on. It's kinda funny that y'all are using EMR with Flintrock, since one of the reasons someone might use Flintrock is to not have to use EMR! 😄