nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Various errors when launching a cluster #283

Closed nchammas closed 5 years ago

nchammas commented 5 years ago

I have constant failures to launch a cluster. With 2 slaves, I get an ssh timeout on one or more machines. Unfortunately the machines are actually created and so end up costing me--even if I tell flintrock to destroy them.

Does this intermittent timeout have something to do with the speed of Apache mirrors? This issue is hte only one mentioned with the --debug switch. BTW I use the Ubuntu AMI and "ubuntu" user.

Originally posted by @pferrel in https://github.com/nchammas/flintrock/issues/238#issuecomment-451031832

nchammas commented 5 years ago

@pferrel - I took the liberty of creating a new issue for you. Please post the details of your issue here and not on #238, per my request:

Please try launching clusters with Amazon Linux. Flintrock doesn't support Ubuntu (#95). If you still have problems, please open a separate issue with the details of what you're seeing.

Please post your config, as well as the output you are seeing. Once you do, I will delete the off-topic comments on #238.

pferrel commented 5 years ago

Switched to Amazon Linux, still get the errors. At least once the master came up and I see 1 of the 2 slave workers in the GUI.

Maclaurin: pat$ flintrock --config project-poc.yaml --debug launch project-poc
2019-01-02 17:06:14,954 - flintrock.flintrock - WARNING - Warning: Downloading Spark from an Apache mirror. Apache mirrors are often slow and unreliable, and typically only serve the most recent releases. We strongly recommend you specify a custom download source. For more background on this issue, please see: https://github.com/nchammas/flintrock/issues/238
2019-01-02 17:06:24,773 - flintrock.ec2       - INFO  - Launching 3 instances...
2019-01-02 17:06:38,275 - flintrock.ec2       - DEBUG - 3 instances not in state 'running': 'i-021a99dfe4b144639', 'i-097f5ce577f2bd1d3', 'i-05924e8c036bf6046', ...
2019-01-02 17:06:44,184 - flintrock.ssh       - DEBUG - [35.173.36.84] SSH exception: [Errno None] Unable to connect to port 22 on 35.173.36.84
2019-01-02 17:06:45,083 - flintrock.ssh       - DEBUG - [34.229.64.159] SSH timeout.
2019-01-02 17:06:45,084 - flintrock.ssh       - DEBUG - [18.206.177.120] SSH timeout.
2019-01-02 17:06:49,948 - flintrock.ssh       - INFO  - [35.173.36.84] SSH online.
2019-01-02 17:06:50,182 - flintrock.ssh       - DEBUG - [18.206.177.120] SSH exception: [Errno None] Unable to connect to port 22 on 18.206.177.120
2019-01-02 17:06:51,007 - flintrock.ssh       - INFO  - [34.229.64.159] SSH online.
2019-01-02 17:06:51,229 - flintrock.core      - INFO  - [35.173.36.84] Configuring ephemeral storage...
2019-01-02 17:06:52,208 - flintrock.core      - INFO  - [34.229.64.159] Configuring ephemeral storage...
2019-01-02 17:06:52,536 - flintrock.core      - INFO  - [35.173.36.84] Installing Java 1.8...
2019-01-02 17:06:53,559 - flintrock.core      - INFO  - [34.229.64.159] Installing Java 1.8...
2019-01-02 17:06:55,993 - flintrock.ssh       - INFO  - [18.206.177.120] SSH online.
2019-01-02 17:06:57,484 - flintrock.core      - INFO  - [18.206.177.120] Configuring ephemeral storage...
2019-01-02 17:06:58,788 - flintrock.core      - INFO  - [18.206.177.120] Installing Java 1.8...
2019-01-02 17:07:05,087 - flintrock.services  - INFO  - [35.173.36.84] Installing Spark...
2019-01-02 17:07:08,206 - flintrock.services  - INFO  - [34.229.64.159] Installing Spark...
2019-01-02 17:07:09,574 - flintrock.services  - INFO  - [18.206.177.120] Installing Spark...
w2019-01-02 17:07:22,939 - flintrock.services  - INFO  - [18.206.177.120] Configuring Spark master...
^R
2019-01-02 17:08:55,996 - flintrock.services  - DEBUG - Timed out waiting for Spark master to come up. Trying again...
2019-01-02 17:10:28,973 - flintrock.services  - DEBUG - Timed out waiting for Spark master to come up. Trying again...
2019-01-02 17:12:02,133 - flintrock.services  - DEBUG - Timed out waiting for Spark master to come up.
Do you want to terminate the 3 instances created by this operation? [Y/n]: Y
Terminating instances...
Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 11, in <module>
    load_entry_point('Flintrock==0.10.0', 'console_scripts', 'flintrock')()
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/flintrock.py", line 1185, in main
    cli(obj={})
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/flintrock.py", line 456, in launch
    tags=ec2_tags)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/ec2.py", line 53, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/ec2.py", line 955, in launch
    identity_file=identity_file)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/core.py", line 651, in provision_cluster
    cluster=cluster)
  File "/usr/local/Cellar/flintrock/0.10.0/libexec/lib/python3.7/site-packages/flintrock/services.py", line 407, in configure_master
    raise Exception("Timed out waiting for Spark master to come up.")
Exception: Timed out waiting for Spark master to come up.
Maclaurin: pat$ 
pferrel commented 5 years ago

latest project-poc config file only one slave the above was the same config but 2 slaves:

services:
  spark:
    version: 2.3.2
    # git-commit: latest  # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
    # git-repository:  # optional; defaults to https://github.com/apache/spark
    # optional; defaults to download from from the official Spark S3 bucket
    #   - must contain a {v} template corresponding to the version
    #   - Spark must be pre-built
    #   - must be a tar.gz file
    # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
    # executor-instances: 1
  hdfs:
    version: 2.8.4
    # optional; defaults to download from a dynamically selected Apache mirror
    #   - must contain a {v} template corresponding to the version
    #   - must be a .tar.gz file
    # download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
    # download-source: "http://www-us.apache.org/dist/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz"

provider: ec2

providers:
  ec2:

    key-name: project-poc
    identity-file: /Users/pat/project/project-poc.pem
    instance-type: r5.xlarge
    region: us-east-1
    availability-zone: us-east-1f
    ami: ami-009d6802948d06e52   # centos, us-east-1
    user: ec2-user
    # ami: ami-61bbf104   # CentOS 7, us-east-1
    # user: centos
    # spot-price: <price>
    vpc-id: vpc-43218f39
    subnet-id: subnet-c23760cd
    # placement-group: <name>
    #security-groups:
    #   - ssh
    #   - group-name2
    # instance-profile-name:
    # tags:
    #   - key1,value1
    #   - key2, value2  # leading/trailing spaces are trimmed
    #   - key3,  # value will be empty
    # min-root-ebs-size-gb: <size-gb>
    tenancy: default  # default | dedicated
    ebs-optimized: no  # yes | no
    instance-initiated-shutdown-behavior: terminate  # terminate | stop
    # user-data: /path/to/userdata/script

launch:
  num-slaves: 1
  # install-hdfs: True
  install-spark: True

debug: false
nchammas commented 5 years ago

Are you using the latest release of Flintrock? Please post the version of Flintrock you are using.

Please also try your launch with vanilla Amazon Linux 2. The AMI IDs are at the bottom of that page.

If you are still having problems with the latest versions of Flintrock + Amazon Linux 2, then on the next launch failure don't destroy the cluster and instead SSH in to the master and take a look at the Spark logs to see why the master is failing to come up. That should give you some clues as to what's going on.

pferrel commented 5 years ago

Sorry, changed the comment above, yes I was using the Amazon Linux AMI.

The most recent try brought up one slave successfully, trying 2 now.

pferrel commented 5 years ago

yes, latest 0.10.0

pferrel commented 5 years ago

Adding slaves seems to have worked, I created a cluster with 1 slave, the GUI at the time (yes refreshed)showed none working.

I added 2 now I have 3 slaves running.

If I can believe the GUI, adding the 2 slaves also fixed the previously non-connected slave.

Again from a newb perspective (I have run Spark often for at least 4 years, but am new to flintrock) there seem to be timeout issues here.

nchammas commented 5 years ago

Please help me help you by following my debugging instructions.

The latest version of Flintrock is 0.11.0, not 0.10.0. Please try launching a fresh cluster with the latest versions of Flintrock and Amazon Linux 2. Specifically, I recommend ami-0b8d0d6ac70e5750c, which is the latest, EBS-based Amazon Linux 2 AMI (which Flintrock 0.11.0 configures for you by default).

If you still have Spark master timeout issues with this configuration, please post the contents of the Spark master log as that will give us some clues as to why the Spark master is not coming up.

nchammas commented 5 years ago

Looking back at the logs you posted, it looks like you installed Flintrock via Homebrew, which is a community-supported distribution (i.e. I don't maintain it) and which is unfortunately out of date.

If you install Flintrock via pip / PyPI, you'll get the latest version.

pferrel commented 5 years ago

Ah, yes. I was surprised when I saw the version, I thought the latest was 0.11. Will try tomorrow.

pferrel commented 5 years ago

ok, switched to 0.11.0 and get the same timeouts for ssh

here is my config:

services:
  spark:
    version: 2.3.2
    # git-commit: latest  # if not 'latest', provide a full commit SHA; e.g. d6dc12ef0146ae409834c78737c116050961f350
    # git-repository:  # optional; defaults to https://github.com/apache/spark
    # optional; defaults to download from from the official Spark S3 bucket
    #   - must contain a {v} template corresponding to the version
    #   - Spark must be pre-built
    #   - must be a tar.gz file
    # download-source: "https://www.example.com/files/spark/{v}/spark-{v}.tar.gz"
    # executor-instances: 1
  hdfs:
    version: 2.8.4
    # optional; defaults to download from a dynamically selected Apache mirror
    #   - must contain a {v} template corresponding to the version
    #   - must be a .tar.gz file
    # download-source: "https://www.example.com/files/hadoop/{v}/hadoop-{v}.tar.gz"
    # download-source: "http://www-us.apache.org/dist/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz"

provider: ec2

providers:
  ec2:
    key-name: project-poc
    identity-file: /Users/pat/project/project-poc.pem
    instance-type: r5.xlarge
    region: us-east-1
    availability-zone: us-east-1f
    ami: ami-009d6802948d06e52   # Amazon Linux, us-east-1
    user: ec2-user
    # ami: ami-61bbf104   # CentOS 7, us-east-1
    # user: centos
    # spot-price: <price>
    vpc-id: vpc-43218f39
    subnet-id: subnet-c23760cd
    # placement-group: <name>
    #security-groups:
    #   - ssh
    #   - group-name2
    # instance-profile-name:
    # tags:
    #   - key1,value1
    #   - key2, value2  # leading/trailing spaces are trimmed
    #   - key3,  # value will be empty
    # min-root-ebs-size-gb: <size-gb>
    tenancy: default  # default | dedicated
    ebs-optimized: no  # yes | no
    instance-initiated-shutdown-behavior: terminate  # terminate | stop
    # user-data: /path/to/userdata/script

launch:
  num-slaves: 1
  # install-hdfs: True
  install-spark: True

debug: false

CLI + error

Maclaurin:project pat$ flintrock --debug --config project-poc.yaml launch project-poc
2019-01-05 14:02:01,367 - flintrock.flintrock - WARNING - Warning: Downloading Spark from an Apache mirror. Apache mirrors are often slow and unreliable, and typically only serve the most recent releases. We strongly recommend you specify a custom download source. For more background on this issue, please see: https://github.com/nchammas/flintrock/issues/238
2019-01-05 14:02:11,425 - flintrock.ec2       - INFO  - Launching 2 instances...
2019-01-05 14:02:24,695 - flintrock.ec2       - DEBUG - 2 instances not in state 'running': 'i-03f08a79ec7a8e247', 'i-0013331e05de7209b', ...
2019-01-05 14:02:30,518 - flintrock.ssh       - DEBUG - [3.81.28.114] SSH exception: [Errno None] Unable to connect to port 22 on 3.81.28.114
2019-01-05 14:02:31,418 - flintrock.ssh       - DEBUG - [35.175.191.230] SSH timeout.
2019-01-05 14:02:36,294 - flintrock.ssh       - INFO  - [3.81.28.114] SSH online.
Exception: Error reading SSH protocol banner[Errno 54] Connection reset by peer
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 2138, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/usr/local/lib/python3.7/site-packages/paramiko/packet.py", line 367, in readline
    buf += self._read_timeout(timeout)
  File "/usr/local/lib/python3.7/site-packages/paramiko/packet.py", line 561, in _read_timeout
    x = self.__socket.recv(128)
ConnectionResetError: [Errno 54] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 1966, in run
    self._check_banner()
  File "/usr/local/lib/python3.7/site-packages/paramiko/transport.py", line 2143, in _check_banner
    "Error reading SSH protocol banner" + str(e)
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 54] Connection reset by peer

2019-01-05 14:02:37,672 - flintrock.core      - INFO  - [3.81.28.114] Configuring ephemeral storage...
2019-01-05 14:02:39,165 - flintrock.core      - INFO  - [3.81.28.114] Installing Java 1.8...
2019-01-05 14:02:49,926 - flintrock.services  - INFO  - [3.81.28.114] Installing Spark...
Do you want to terminate the 2 instances created by this operation? [Y/n]: Y
Terminating instances...
[35.175.191.230] SSH protocol error. Possible causes include using the wrong key file or username.

This happens almost every time I try to launch. Yes, I am using the Apache mirrors for download but this happens before the download since you need to have ssh to download.

In my experience by hand it takes much longer for ssh to become active on an instance than flintrock is allowing. Is there some way to change the ssh timeout or insert a delay before it is tried?

nchammas commented 5 years ago

This issue is different from the one you were reporting earlier. Before the Spark master was having trouble coming up. Now, SSH is failing to connect with a specific banner error. The latter suggests an issue with the AMI or user you have configured, as communicated in Flintrock’s error message. This is almost certainly not related to the wait timeout.

Please try the AMI I suggested earlier before trying your own setup. It will help narrow down what’s going on. In fact, please try the default config Flintrock provides by deleting any existing config files you have and then calling flintrock configure.

The default config works and is tested before every release. Start there and carefully work your way from there to the setup you actually want.

We’re spending a lot of time gong back and forth because we don’t have a common baseline to start testing from. Starting with that common baseline will cut down further back and forth.

pferrel commented 5 years ago

I have used the standard Amazon Linux AMI and the ec2-user. The key is obviously working. I did the flintrock configure before customizing as shown above.

I suppose you see above that there are timeouts before, what I assume is a retry eventually works. Maybe something in my AWS config causes a slower initialization? I have to install in the correct subnet and VPC so I can try default config but will have to move back to this if I'm going to use fintrock.

I'm conversant enough in python (barely) to mod with delays or retries if you can point me in the right direction.

Thanks in any case.

nchammas commented 5 years ago

SSH timeouts are normal when an instance is coming up. Every successful cluster launch will show that in its debug output. What's not normal is the SSH banner error.

paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 54] Connection reset by peer

This error suggests there is something with your SSH key, SSH user, or with the AMI. Adding more retries or longer delays between tries won't help. Flintrock retries connecting only when there is an SSH timeout. Retrying when another error (like this banner error) is thrown doesn't make sense. If you want to try anyway, you can play around with the code here, but I don't think you'll get anywhere by doing that.

Note, by the way, that this error is different from what you were originally reporting. Originally, you were reporting this problem:

Exception: Timed out waiting for Spark master to come up.

This happens much further along in the launch process, meaning Flintrock was able to connect to the instances fine and do most of its work. Somewhere between when you saw this Spark master error and when you saw the SSH banner error, you must have changed something about your config or setup. I doubt upgrading Flintrock from 0.10.0 to 0.11.0 is what changed one error to the other.

This conversation must be frustrating for you. It certainly is for me. It is difficult for me to help when you aren't using the vanilla Amazon Linux 2 AMI and default Flintrock configuration I am suggesting. Starting from a known good configuration and working from there to the desired configuration is a basic debugging tactic. Unless we apply this tactic, I don't think I can help you any further.

pferrel commented 5 years ago

I AM using flintrock 0.11.0 and the Vanilla Amazon Linux AMI. Please read the report above. This is the AMI they recommend when you launch any instance. I took the AMI ID from the launch instance dialog by selecting the Vanilla Amazon Linux option and copying the AMI ID.

In any case this config WORKS but intermittently so there is nothing wrong with AMI, user, or key. The fault must be in the timing since as I say sometimes is works.

Also I'm not sure why you changed the name of the issue since the error reported in the CLI is ssh timeout. This is how people will search for the solution or suggestions.

nchammas commented 5 years ago

I suggested ami-0b8d0d6ac70e5750c, which is also the default AMI Flintrock uses. The configs you posted above show you are using a different AMI. ~Actually, the AMI you appear to be using is instance store-backed, not EBS-backed.~ Just FYI, Flintrock is only tested with EBS-backed Amazon Linux AMIs.

I changed the issue title because, as I pointed out in my previous message, you initially reported Spark master timeouts, which is a different issue from the SSH timeouts we are now talking about.

I am also still confused how the error changed from the Spark master timeouts to the SSH timeouts.

pferrel commented 5 years ago

Notice the AMI ID in my config is the same as here

image

pferrel commented 5 years ago

The default AMI for Flintrock is not available for my zone. Further the link in the README.md to available AMIs is no longer active.

nchammas commented 5 years ago

The default AMI for Flintrock is not available for my zone.

ami-009d6802948d06e52, which you are using, and ami-0b8d0d6ac70e5750c, which I am recommending, are both Amazon Linux 2 AMIs provided by Amazon for us-east-1, so I am not sure how you can use one but not the other. What zone are you working in?

Further the link in the README.md to available AMIs is no longer active.

What link?

pferrel commented 5 years ago

This seems to be resolved by using the specific AIM: ami-0b8d0d6ac70e5750c

nchammas commented 5 years ago

OK, that's good. If you now have a working baseline, it should be easier to identify what specifically breaks things by introducing changes to that working baseline one by one. Knowing that Flintrock works with the configuration I've recommended at least rules out issues with your VPC, subnet, or security groups, which is good.

If keeping everything the same and simply changing the AMI is what breaks things, then we can hone in on that to understand why. But it's critical to know that the AMI is the only thing that has changed from the working baseline to break things.

If you'd like to keep debugging this, please provide an update on what exactly broke things from the working baseline and I'd be happy to reopen this issue.

olivierdesclaux commented 3 years ago

Hello, I am trying to launch a cluster of m5.xlarge instances and I am running into the same issue. I have changed the default AMI to the AMI you gave (ami-0b8d0d6ac70e5750c) and that still doesn't work. Below, snapshots of what error messages appear on the console.

image

I would be very grateful for your help, thank you in advance