nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
638 stars 116 forks source link

Add add-slaves and remove-slaves commands #115

Closed nchammas closed 8 years ago

nchammas commented 8 years ago

This PR adds add-slaves and remove-slaves commands which enable the user to resize existing clusters. add-slaves will query the master for its configuration and automatically use that information to provision new slaves. The only configuration the user may want to specify is the spot price for new slaves.

The PR also fixes a number of bugs, including a surprising config problem I discovered ( 0c9a24b; how did it not stand out before? 🤔) that may have been causing sporadic problems. These bugs were interfering with my ability to test this PR, so I thought it would be good to fix them as part of this work.

TODO:

Open questions:

Fixes #16. Fixes #113. Fixes #140.

nchammas commented 8 years ago

Not sure why the Travis build is failing. Am investigating here: https://github.com/pyinstaller/pyinstaller/issues/2123

nchammas commented 8 years ago

I am ready to merge this PR in.

Pinging @ereed-tesla, @serialx, and @soyeonbaek-dev, since I know you are interested in this feature.

Do you have any feedback or questions about this before I merge it in? Especially regarding the some of the open questions noted above.

serialx commented 8 years ago

First, hooray for adding cluster resize! Very awaited feature indeed! 😃

For those who don't know: We have been cooking our own cluster resize patches to flintrock and have been using that for over two weeks. We mainly only use Spark without HDFS, so our experiences with dynamic cluster sizes are limited to Spark. And we don't restart master. It seems to work fine with both adding and removing slaves. Since the system itself is fault tolerant, Spark seems to work fine with this.

About removing slaves. The code simply removes N slaves in the list. I think it would be better to have some kind of prioritisation, like removing spot instances first. You can see @soyeonbaek-dev's code

nchammas commented 8 years ago

About removing slaves. ... I think it would be better to have some kind of prioritisation, like removing spot instances first.

This is a good suggestion. If you have a mix of spot and on-demand instances (which this PR now makes possible), you probably do want to remove the spot instances first.

I'll add this to the PR.

nchammas commented 8 years ago

OK, that's done. I'll leave this open for a few more days and then merge it in if there's no more feedback.

serialx commented 8 years ago

Testing this branch revealed a bug related to configuring EC2 instance profiles. When I use --ec2-instance-profile-name settings to create a cluster, add-slaves command fails with the following error:

Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 9, in <module>
    load_entry_point('Flintrock==0.6.0.dev0', 'console_scripts', 'flintrock')()
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 272, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 677, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (AIPAJZTBVWIAVNULB3U6W) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I was able to fix this bug using a simple patch. But I don't really like how this works. Perhaps you have a better solution to this. :)

diff --git a/flintrock/ec2.py b/flintrock/ec2.py
index 0f3dd1a..630d8fa 100644
--- a/flintrock/ec2.py
+++ b/flintrock/ec2.py
@@ -249,7 +249,10 @@ class EC2Cluster(FlintrockCluster):
         if not self.master_instance.iam_instance_profile:
             instance_profile_name = ''
         else:
-            instance_profile_name = self.master_instance.iam_instance_profile['Id']
+            iam = boto3.resource('iam')
+            instance_profile_id = self.master_instance.iam_instance_profile['Id']
+            profiles = filter(lambda x: x.instance_profile_id == instance_profile_id, iam.instance_profiles.all())
+            instance_profile_name = list(profiles)[0].instance_profile_name
         instance_initiated_shutdown_behavior = response['InstanceInitiatedShutdownBehavior']['Value']

         self.add_slaves_check()

Edit: Oops! Wrong traceback. Changed to the one relevant!

nchammas commented 8 years ago

Hmm, so the IAM profile ID is different from the IAM profile name? I presume iam_instance_profile['Arn'] is also not what we want?

I'll take a closer look at this tonight or tomorrow.

serialx commented 8 years ago

I tried the Arn approach. Also failed:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/serialx/workspace/flintrock/flintrock/__main__.py", line 8, in <module>
    sys.exit(main())
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 1031, in main
    cli(obj={})
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 653, in add_slaves
    **provider_options)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 46, in wrapper
    res = func(*args, **kwargs)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 275, in add_slaves
    instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
  File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 680, in _create_instances
    'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
    raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (arn:aws:iam::527973214447:instance-profile/zeppelin-role) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

I guess we need to feed it with IAM Profile Names. :(

nchammas commented 8 years ago

Thanks for investigating @serialx, it looks like you're right. The only way to get the profile by name is with the heavy-handed pull and filter. I've reported a couple of issues against Boto3 to see if there is a way this can be fixed at some point:

I've pushed a new commit that fixes this based off of your patch above. Take a look and let me know if you have any other concerns. Thanks again for finding and reporting this issue.