nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Possible regression? Cluster Launch With More than 20 instances hits AWS Rate Limits #340

Closed rlaabs closed 3 years ago

rlaabs commented 3 years ago

When launching a cluster (tried m5 and r4 instances) with about more than 20 instances the following error is raised:

botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DescribeSubnets operation (reached max retries: 4): Request limit exceeded.

From the traceback it looks like this may be caused by the changes in #296 ?

-->

Traceback:

Traceback (most recent call last):
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/bin/flintrock", line 8, in <module>
    sys.exit(main())
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/flintrock.py", line 1247, in main
    cli(obj={})
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/flintrock.py", line 486, in launch
    cluster = ec2.launch(
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/ec2.py", line 54, in wrapper
    res = func(*args, **kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/ec2.py", line 983, in launch
    provision_cluster(
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/core.py", line 714, in provision_cluster
    run_against_hosts(partial_func=partial_func, hosts=hosts)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/core.py", line 510, in run_against_hosts
    future.result()
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/core.py", line 775, in provision_node
    service.configure(
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/services.py", line 222, in configure
    mapping=generate_template_mapping(
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/core.py", line 458, in generate_template_mapping
    'master_ip': cluster.master_ip,
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/ec2.py", line 90, in master_ip
    if self.private_network:
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/flintrock/ec2.py", line 86, in private_network
    return not ec2.Subnet(self.master_instance.subnet_id).map_public_ip_on_launch
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/boto3/resources/factory.py", line 339, in property_loader
    self.load()
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/boto3/resources/factory.py", line 505, in do_action
    response = action(self, *args, **kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/botocore/client.py", line 276, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/robertlaabs/opt/miniconda3/envs/flintrock_test/lib/python3.9/site-packages/botocore/client.py", line 586, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DescribeSubnets operation (reached max retries: 4): Request limit exceeded.
nchammas commented 3 years ago

Thanks for the report. I guess the problem is in the repeated calls to ec2.Subnet():

https://github.com/nchammas/flintrock/blob/4763e4c001b180a14e45dc86ab3b875766b01a63/flintrock/ec2.py#L83-L86

I believe @luhhujbb warned that this would happen. I think a fix worth exploring would be to decorate private_network() with @lru_cache. (There are newer alternatives available like @cached_property, but that requires Python 3.8+. Flintrock currently supports 3.6+.)

luhhujbb commented 3 years ago

Hi, it seems that @cached_property is compatible with python 3.6 : cached-property

nchammas commented 3 years ago

That's neat. I would prefer to stick to the standard library, but if there is a problem combining @property with @lru_cache then cached-property would be a good way to go.

luhhujbb commented 3 years ago

@nchammas It's seems Ok for the combination of @property and @functools.lru_cache cf : https://stackoverflow.com/questions/4037481/caching-class-attributes-in-python