Closed nchammas closed 8 years ago
Not sure why the Travis build is failing. Am investigating here: https://github.com/pyinstaller/pyinstaller/issues/2123
I am ready to merge this PR in.
Pinging @ereed-tesla, @serialx, and @soyeonbaek-dev, since I know you are interested in this feature.
Do you have any feedback or questions about this before I merge it in? Especially regarding the some of the open questions noted above.
First, hooray for adding cluster resize! Very awaited feature indeed! 😃
For those who don't know: We have been cooking our own cluster resize patches to flintrock and have been using that for over two weeks. We mainly only use Spark without HDFS, so our experiences with dynamic cluster sizes are limited to Spark. And we don't restart master. It seems to work fine with both adding and removing slaves. Since the system itself is fault tolerant, Spark seems to work fine with this.
About removing slaves. The code simply removes N slaves in the list. I think it would be better to have some kind of prioritisation, like removing spot instances first. You can see @soyeonbaek-dev's code
About removing slaves. ... I think it would be better to have some kind of prioritisation, like removing spot instances first.
This is a good suggestion. If you have a mix of spot and on-demand instances (which this PR now makes possible), you probably do want to remove the spot instances first.
I'll add this to the PR.
OK, that's done. I'll leave this open for a few more days and then merge it in if there's no more feedback.
Testing this branch revealed a bug related to configuring EC2 instance profiles. When I use --ec2-instance-profile-name
settings to create a cluster, add-slaves
command fails with the following error:
Traceback (most recent call last):
File "/usr/local/bin/flintrock", line 9, in <module>
load_entry_point('Flintrock==0.6.0.dev0', 'console_scripts', 'flintrock')()
File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 1031, in main
cli(obj={})
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/flintrock/flintrock.py", line 653, in add_slaves
**provider_options)
File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 46, in wrapper
res = func(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 272, in add_slaves
instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
File "/usr/local/lib/python3.5/site-packages/flintrock/ec2.py", line 677, in _create_instances
'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (AIPAJZTBVWIAVNULB3U6W) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
I was able to fix this bug using a simple patch. But I don't really like how this works. Perhaps you have a better solution to this. :)
diff --git a/flintrock/ec2.py b/flintrock/ec2.py
index 0f3dd1a..630d8fa 100644
--- a/flintrock/ec2.py
+++ b/flintrock/ec2.py
@@ -249,7 +249,10 @@ class EC2Cluster(FlintrockCluster):
if not self.master_instance.iam_instance_profile:
instance_profile_name = ''
else:
- instance_profile_name = self.master_instance.iam_instance_profile['Id']
+ iam = boto3.resource('iam')
+ instance_profile_id = self.master_instance.iam_instance_profile['Id']
+ profiles = filter(lambda x: x.instance_profile_id == instance_profile_id, iam.instance_profiles.all())
+ instance_profile_name = list(profiles)[0].instance_profile_name
instance_initiated_shutdown_behavior = response['InstanceInitiatedShutdownBehavior']['Value']
self.add_slaves_check()
Edit: Oops! Wrong traceback. Changed to the one relevant!
Hmm, so the IAM profile ID is different from the IAM profile name? I presume iam_instance_profile['Arn']
is also not what we want?
I'll take a closer look at this tonight or tomorrow.
I tried the Arn approach. Also failed:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/serialx/workspace/flintrock/flintrock/__main__.py", line 8, in <module>
sys.exit(main())
File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 1031, in main
cli(obj={})
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 716, in __call__
return self.main(*args, **kwargs)
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/Users/serialx/workspace/flintrock/flintrock/flintrock.py", line 653, in add_slaves
**provider_options)
File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 46, in wrapper
res = func(*args, **kwargs)
File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 275, in add_slaves
instance_initiated_shutdown_behavior=instance_initiated_shutdown_behavior)
File "/Users/serialx/workspace/flintrock/flintrock/ec2.py", line 680, in _create_instances
'EbsOptimized': ebs_optimized})['SpotInstanceRequests']
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 278, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/serialx/.virtualenvs/flintrock/lib/python3.5/site-packages/botocore/client.py", line 572, in _make_api_call
raise ClientError(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the RequestSpotInstances operation: Value (arn:aws:iam::527973214447:instance-profile/zeppelin-role) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name
I guess we need to feed it with IAM Profile Names. :(
Thanks for investigating @serialx, it looks like you're right. The only way to get the profile by name is with the heavy-handed pull and filter. I've reported a couple of issues against Boto3 to see if there is a way this can be fixed at some point:
I've pushed a new commit that fixes this based off of your patch above. Take a look and let me know if you have any other concerns. Thanks again for finding and reporting this issue.
This PR adds
add-slaves
andremove-slaves
commands which enable the user to resize existing clusters.add-slaves
will query the master for its configuration and automatically use that information to provision new slaves. The only configuration the user may want to specify is the spot price for new slaves.The PR also fixes a number of bugs, including a surprising config problem I discovered ( 0c9a24b; how did it not stand out before? 🤔) that may have been causing sporadic problems. These bugs were interfering with my ability to test this PR, so I thought it would be good to fix them as part of this work.
TODO:
add-slaves
.remove-slaves
.instances.filter()
from empty input. (see https://github.com/boto/boto3/issues/479)[x]
Add acceptance tests.Will defer this till later since the full test suite already takes too long, and since testing this properly is a bit difficult.
Open questions:
Does the HDFS master need to be restarted
HDFS seems to work properly, though warnings are thrown when you add new files due to the dead slaves.
hdfs dfsadmin -report
still shows the removed slaves, andhdfs dfsadmin -refreshNodes
doesn't seem to remove them from the report thatdfsadmin
shows. I'm not sure if this is OK.Fixes #16. Fixes #113. Fixes #140.