nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

OSX Yosemite 200 node cluster throws paramiko.rsakey.RSAKey object ERROR #81

Closed dhulse closed 8 years ago

dhulse commented 8 years ago

I have been facing the same issue reported here at https://github.com/nchammas/flintrock/issues/78 when I attempt to launch a 200 node spot instance cluster

I followed http://blog.mact.me/2014/10/22/yosemite-upgrade-changes-open-file-limit to increase my file descriptor limit. This removed the occasional "Too many open files" error. However, the RSAKey object error still persists.

('54.80.115.229', <paramiko.rsakey.RSAKey object at 0x142d488d0>, <paramiko.rsakey.RSAKey object at 0x142d29e80>)

I have an Ubuntu VM that does not run into this issue and it is able to create a 200 node cluster without a problem.

nchammas commented 8 years ago

Thanks for the report. Quick questions:

As you saw, I wasn't able to get to the bottom of the issue reported in #78, so any information you can provide here that will narrow down what the issue may be caused by will be valuable.

dhulse commented 8 years ago

I was on master until the release of 0.2.0, now I am running flintrock from the pip install for 0.2.0. I have seen this issue for both the master (before 0.2.0) and for 0.2.0

Each time I attempt to launch a 200 node cluster in OSX, I see one of the two errors below.

Either

('54.80.115.229', <paramiko.rsakey.RSAKey object at 0x142d488d0>, <paramiko.rsakey.RSAKey object at 0x142d29e80>)

or

[54.144.28.107] Installing Spark...
[54.198.68.196] Installing Spark...
[54.224.5.106] Installing Spark...
[54.81.174.2] Installing Spark...
[54.166.206.188] Installing Spark...
[54.159.109.137] Installing Spark...
[54.198.160.59] Installing Spark...
[54.161.112.198] Installing Spark...
[54.198.198.125] Installing Spark...
[107.22.39.241] Installing Spark...
[54.80.97.205] Installing Spark...
[54.144.59.175] Installing Spark...
[23.20.213.129] Installing Spark...
[54.144.3.204] Installing Spark...
[54.80.80.222] Installing Spark...
[54.92.154.121] Installing Spark...
[54.198.115.182] Installing Spark...
[54.161.83.226] Installing Spark...
[54.167.217.170] Installing Spark...
[54.167.217.125] Installing Spark...
Error: Failed to install Spark.
Installing Spark...
  version: 1.6.0
  distribution: hadoop2.6
At least one node raised an error: Installing Spark...
  version: 1.6.0
  distribution: hadoop2.6
Do you want to terminate the 201 instances created by this operation? [Y/n]:

I do not hit this same issue when I launch a smaller cluster (The only smaller cluster I have launched at this point is with --num-slaves 5)

nchammas commented 8 years ago

OK, I am able to reproduce the error with installing Spark when I spin up a cluster with 200 slaves. For the record, could you update your comment with full stack traces or a little more context?

For example, I get:

Error: Failed to install Spark.
[Errno 24] Too many open files: '.../flintrock/flintrock/scripts/install-spark.sh'

I don't see this when I spin up clusters with 50 or even 100 slaves, so it's clearly a resource usage issue. Enabling ResourceWarning confirms this.

This is going to take some time to dig into. I probably need to do a better job of reusing files and closing handles so that Flintrock doesn't hit these limits when we launch really large clusters.

cc @engrean - You may be interested in following this issue.

nchammas commented 8 years ago

Hey @dhulse, what do you get when you run this on OS X vs. Linux?

import resource
resource.getrlimit(resource.RLIMIT_NOFILE)

getrlimit() returns a tuple of (soft, hard) limits for that resource.

If you get a higher value for the soft limit on Linux, that may explain why you don't see this issue on there.

nchammas commented 8 years ago

For example, these are the results I get for how many open file descriptors a process can have by default:

soft limit hard limit
OS X 256 9223372036854775807
Fedora Linux (VM) 1024 4096
dhulse commented 8 years ago

Here are my results for getrlimit()

soft limit hard limit
OS X Yosemite 65536 65536
Ubuntu Linux (VM) 262144 262144
dhulse commented 8 years ago

The higher soft limit would explain why it is working on my Ubuntu VM. I changed my soft and hard limits to be 262144 on OS X Yosemite and I was able to launch a 200 node cluster.

So it would appear that 65536 as a soft limit is not sufficient to create a 200 node cluster.

nchammas commented 8 years ago

I'm surprised your default soft limits are so high.

After seeing that my soft limit was 256, I thought 2048 or even 1024 would be plenty for a 200-node cluster since I've been launching 100-node clusters on OS X without issue at the 256 limit. :confused:

nchammas commented 8 years ago

I'll have to track down why Flintrock consumes so many file handles... This smells wrong, though I'm glad we have a workaround for this issue.

nchammas commented 8 years ago

@dhulse - I've been working on a version of Flintrock that uses a different SSH library and concurrency model. Some basic testing shows that it uses fewer file handles (which actually includes sockets used for SSH connections) during launches.

I'm curious if this version lets you launch a 200-node cluster with your original limits. If you're interested in giving it a shot, you can install it with this:

pip install git+https://github.com/nchammas/flintrock@asyncssh

You'll need Python 3.5+ for this. This may become part of Flintrock in 0.3, depending on whether it helps solves problems like this one, so I'm interested in seeing whether it works for you.

dhulse commented 8 years ago

I ran that pip install you sent and I attempted to launch a 200 node cluster with two different file limit settings. For the first attempt, I used a file limit setting by following this link http://blog.mact.me/2014/10/22/yosemite-upgrade-changes-open-file-limit. For the second attempt, I used the OS X Yosemite default file limit settings.

Here is the version of flintrock I used

flintrock --version
flintrock, version 0.3.0.dev0

Attempt 1

soft limit hard limit
OS X Yosemite 65536 65536

Using my custom file limit settings, I was able to launch a 200 node cluster successfully.

Attempt 2

soft limit hard limit
OS X Yosemite 256 9223372036854775807

Using the default OS X Yosemite file limit settings, I ran into an error immediately after being granted with all the instances. Here is the output:

Requesting 201 spot instances at a max price of $0.2...
0 of 201 instances granted. Waiting...
0 of 201 instances granted. Waiting...
0 of 201 instances granted. Waiting...
0 of 201 instances granted. Waiting...
0 of 201 instances granted. Waiting...
0 of 201 instances granted. Waiting...
63 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
200 of 201 instances granted. Waiting...
All 201 instances granted.
[Errno 24] Too many open files
Do you want to terminate the 201 instances created by this operation? [Y/n]:

I hope this is helpful!

nchammas commented 8 years ago

Great! Your results from Attempt 1 are promising, since they suggest that the AsyncSSH branch uses fewer file handles than the current master branch does during cluster launches. I wonder if things would still work fine with a limit of 2048 or even 1024.

Anyway, I'll put together some tracking code on my side to understand exactly the maximum number of file handles that Flintrock has open simultaneously over the course of a launch.

Once I have some precise measurements for that maximum, I'll pick a value for RLIMIT_NOFILE that should be higher than that for most users launching even obscenely large clusters (maybe up to 1000 nodes?), and have Flintrock set that limit directly so users don't have to do it themselves.

I'll still make an effort to find and fix leaks of open file handles, but my preliminary research suggests that the high allocation of file handles is actually caused by open sockets for SSH connections. Open sockets in Unix are also open files, so I suspect for the most part the high allocation of file handles by Flintrock is reasonable.

nchammas commented 8 years ago

OK, I pushed a few commits to master which fix both issues captured here:

For good measure, I put together a series of tests using code adapted from here and then later using psutil to track the maximum number of file handles Flintrock has open simultaneously during a launch.

import pprint
import psutil
import time
from datetime import datetime

cmdline = []

while True:
    for ps in psutil.process_iter():
        try:
            if ps.name() == 'Python' and any(['flintrock' in part for part in ps.cmdline()]):
                if ps.cmdline() != cmdline:
                    print(ps.cmdline()[2:])
                    cmdline = ps.cmdline()
                print(
                    datetime.now(),
                    ps.num_fds())
                break
        except psutil.NoSuchProcess:
            pass
    time.sleep(0.5)

Here are the results.

Maximum number of simultaneously open file descriptors

Flintrock launch Threads AsyncSSH
1 slave 16 15
20 slaves 36 25
200 slaves 331 414

Running these tests was really expensive (:money_with_wings: :sob:) so I didn't do any repeated runs.

I suspect the numbers are heavily dependent on how quickly SSH becomes available on launched instances (e.g. if everything comes up at once, then more file descriptors will be open simultaneously), but I think they're good enough to demonstrate that a limit of 4096 file handles should be enough for anybody launching even 1000-node clusters. We can easily raise the limit later if we need to.

@dhulse @engrean - If you want, you can test out the latest code on master by resetting your OS limits to their defaults and launching 200+ node clusters with Flintrock. It should work out of the box now on OS X. :v: