nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

Add running scripts before and after cluster provision #142

Closed exLittlePond closed 8 years ago

exLittlePond commented 8 years ago

This PR makes the following changes:

I tested this PR by deploying cluster to AWS

The typical use-case for this functionality would be like below:

BenFradet commented 8 years ago

As described in the testing guide, it's a good idea to run

py.test tests/test_static.py

before submitting a pr to catch any formatting quirks.

exLittlePond commented 8 years ago

Sorry. I fixed it.

Please feel free to give me any comments and suggestions. Thank you.

BenFradet commented 8 years ago

No problem. I'm curious as to what your use case is for those {pre,post}install scripts. didn't see them in the original desc.

nchammas commented 8 years ago

Hi @SoyeonBaek-dev and thank you for this PR.

I believe it's the first time someone has tried to implement a new FlintrockService. What did you think of that abstraction? I'm curious if others find it sensible.

Before getting into my review comments, I'd like to point you briefly to our contributing guide--specifically the section on coordinating first. It's generally better to have a bit of discussion upfront before non-trivial changes like this are proposed, for the reasons described in the guide.

Anyway, on to the review.

First off, I'll say that this is a promising feature proposal. I can see the value in being able to provide custom pre- and post-setup hooks for users to do what they want. However, I'd like to see more user feedback on how these features would be used to decide if it's really worth adding this in.

I think many of the potential use cases can be solved with a combination of run-command and modifications to Flintrock's default default behavior.

For example:

Replace Java 1.7 to Java 1.8. This can only be done BEFORE Spark installation (BTW, Spark 2.0 depreciates Java 1.7 support)

I think this should be part of Flintrock's default behavior, since our goal is to provide a good out-of-the-box Spark experience. Perhaps we should add a ensure_java8() method that makes sure the host has Java 8 or higher, and use that during launch.

By the way, why can this only be done before installing Spark? I'm pretty sure you can install Java 8 and then restart Spark (via sbin/) to get the desired effect, no?

Install custom Python package like Anaconda for use with PySpark

Isn't flintrock run-command sufficient for this today?

Perform special filesystem operations (like RAID, mounting)

This is interesting. Can you elaborate on this?

Flintrock goes to some lengths to make sure that you get fast, automatically mounted EC2 instance-store volumes, if your selected instance type has them. What other types of storage setup would you be interested in doing?

exLittlePond commented 8 years ago

Thank you for your comment.

First of all, I should have checked the contributing guide and discussed with you before PR. Sorry for that.

Getting to the point, I agree with your opinion. Custom post-setup can be replaced with flintrock run-command, however pre-setup hook has some advantages.

  1. Install and use custom Java Runtime (Oracle JDK instead of openjdk, JDK 9, etc.)
  2. No need to restart Spark. Especially after when add-slaves function is added, run-command would need to be more complex, as the script needs to be idempotent or we need to execute the scripts only on the newly added slaves.
  3. Simple and Elegant one command line to support many use-cases.

We have been using Java 8 for a long time, and been doing Spark restarts all the time. Let me explain one of the reason I propose this functionality. We have a custom branch that adds crude add-slaves functionality to flintrock.[1] And since we’d like to add slaves while running analysis, restarting all Spark related daemons are not an option. So we are opted to either: 1. Restart only the slaves that have just been created. Or 2. Build an AMI that has Java 8 by default. So we came up with this kind of functionality. Hope you understand the background to this PR.

[1]: Open-source in our devsisters/flintrock repo

nchammas commented 8 years ago

First of all, I should have checked the contributing guide and discussed with you before PR. Sorry for that.

No worries.

First, regarding the specific issue of needing Java 8, I have created #143. I think we should work on that regardless of what happens here since it's important for the out-of-the-box user experience.

Thanks for detailing your use cases. I can see the value of this feature.

You're right, without a pre-launch hook, any scripts you run with run-command need to be idempotent and need not require a Spark restart.

I'm still hesitant to add something like this in because it's a feature we are implementing ourselves and will have to maintain over the long term. Furthermore, the main driver for it at the moment (installing Java 8) is something we now have plans to address anyway.

How about, instead of adding a new Flintrock feature, we just added support for an existing EC2 feature that does the same thing? That feature is user data.

Would that satisfy your use cases?

serialx commented 8 years ago

I think adding ec2 user data feature is also a viable option. It will solve most of our problems.

nchammas commented 8 years ago

OK, in that case (and if @SoyeonBaek-dev agrees) I suggest we close this PR and start over with a new proposal that just exposes a new --ec2-user-data option.

I also suggest waiting to develop this until after #115 gets in (which should be in the next day or two, barring any major surprises) since the new add-slaves command will affect how this feature is implemented. We will need to discuss whether add-slaves can or should automatically poll the existing master for user data to reuse with the new slaves.

exLittlePond commented 8 years ago

I also agree with ec2-user-data feature, too.

exLittlePond commented 8 years ago

Is it ok if I implement --ec2-user-data feature?

I think add-slaves should automatically poll the existing master for user data to reuse with the new slaves, so that spark clusters can have unified settings.

nchammas commented 8 years ago

@SoyeonBaek-dev - Yes, it's all yours. 👍

And yes, I agree, we should source the data from the master automatically when adding new slaves. This matches the behavior of #115.

nchammas commented 8 years ago

Closing this PR per our discussion above about starting over with a proposal to add an --ec2-user-data option to Flintrock.