nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Make Spark/Hadoop service installation idempotent #237

Open steve-drew-strong-bridge opened 6 years ago

steve-drew-strong-bridge commented 6 years ago

As a user of flintrock, I would like to shave a lot of time off of spinning up new clusters. To do so, I would like to copy an AMI from a previous flintrock install and reuse that.

Expected: without installing HDFS and Spark again, the new AMI is instantiated, the slaves files are updated, the master IP is updated in the appropriate config files and HDFS/Spark is launched.

Bonus expectation: I would love to tell Flintrock that I've already configured the drives correctly and be able to skip the ephemeral drive allocation step as well.

Actual: Today, I have to turn on the installation of each service to get them configured. No time savings for using an AMI with the software pre-installed.

steve-drew-strong-bridge commented 6 years ago

Quick thought: As a shorter path to this, it might be feasible to say that if the service directory exists (as Flintrock would name it) on the instance then Flintrock considers it installed. E.g., /home/ec2-user/hadoop folder exists means skip the install. It's up to the AMI owner at that point to be sure that hadoop is correctly installed. All Flintrock does is move config files and start the services.

The same could be done for drive configuration. If (assuming we can't just look at df for /media/ephemeral0, /media/ephemeral1) a file exists with a specific name (e.g., FlintrockDrivesInstalledDontBotherDoingItAgain) then skip that as well.

nchammas commented 6 years ago

Service installation and configuration are already separated in Flintrock. We leverage this separation when adding new nodes to a cluster, for example, since when that happens all existing nodes need to have their services reconfigured but not reinstalled.

I believe what you're asking for is that installation be idempotent. One easy example of Flintrock implementing a declarative-style method of managing software is ensure_java8().

To accomplish what you're looking for, we'd need to do a few things, some of which you touched on:

  1. Add a new method to FlintrockService, maybe called _is_installed(), which takes the same input as install() and returns a boolean saying whether or not that particular service is installed. That's where we'd capture the logic defining what "installed" means for each service. We'd call _is_installed() somewhere to figure out if we need to do anything. (Maybe this is a good use case for a decorator? I'm not sure.) It may also be better to just add the appropriate logic directly to each install() method.

  2. We need to do something similar for the ephemeral drives. The "proper" way to do it is probably to convert that code into a FlintrockService and follow the _is_installed() pattern, but we can also get away with just updating this code to make the check we want and skip setup when appropriate.

    Detecting when a drive is already setup and doesn't need any work is tricky because EC2 behavior in this regard is, to quote myself, "haphazard". Maybe your idea of having a marker file would work, but I'm concerned about adding a lot of clutter to handle this and muddying the logic for formatting ephemeral vs. EBS volumes.

This is a good request, and discussing it reminds me again just how close Flintrock comes to reinventing other tools (like Ansible). 😄 Since Flintrock is strictly limited to Apache Spark and Hadoop, I'm fine with refining how we do things as long as it doesn't add a lot of complexity.

It'll take a bit of work here to implement this in a non-hacky way, but I think it's possible, especially for the main services like Spark and Hadoop.