nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
636 stars 116 forks source link

Possible issue in launching large number of instances and then destroying the cluster #341

Closed mkhan037 closed 3 years ago

mkhan037 commented 3 years ago

When launching a 25 slave t2.micro spark-cluster to check whether issue #340 also happens in our case, I faced a different problem. Even though I mentioned the number of slaves to be 24, flintrock launched a total of 23 nodes at first. When I tried to destroy the cluster immediately, it encountered an error. And subsequent calls to destroy showed that the cluster had 2 running nodes and faced errors in destroying the cluster. I terminated the instances from the AWS console, after which flintrock destroy did not throw an error.

Note my current running instance vCPU limit is 32, however, that should not be an issue as the vCPU of the launched cluster was 25.

log

(aws)[mkhan@inv36 ~]$ flintrock launch --num-slaves 24 spark-cluster
Launching 25 instances...
[3.141.106.37] SSH online.
[18.219.138.172] SSH online.
[18.222.201.80] SSH online.
[3.141.105.193] SSH online.
[3.18.112.6] SSH online.
[18.188.57.208] SSH online.
[18.221.206.110] SSH online.
[18.219.151.70] SSH online.
[18.223.162.153] SSH online.
[18.216.105.247] SSH online.
[18.216.84.221] SSH online.
[18.216.174.79] SSH online.
[18.188.59.110] SSH online.
[13.58.202.233] SSH online.
[18.116.204.133] SSH online.
[3.15.177.192] SSH online.
[13.58.48.119] SSH online.
[52.15.106.68] SSH online.
[18.221.197.55] SSH online.
[52.14.171.25] SSH online.
[3.137.151.208] SSH online.
[13.58.189.39] SSH online.
[3.131.94.95] SSH online.
[3.141.106.37] Configuring ephemeral storage...
[18.219.138.172] Configuring ephemeral storage...
[3.18.112.6] Configuring ephemeral storage...
[18.221.206.110] Configuring ephemeral storage...
[18.216.105.247] Configuring ephemeral storage...
[18.222.201.80] Configuring ephemeral storage...
[3.141.105.193] Configuring ephemeral storage...
[18.188.57.208] Configuring ephemeral storage...
[18.216.174.79] Configuring ephemeral storage...
[18.219.151.70] Configuring ephemeral storage...
[3.15.177.192] Configuring ephemeral storage...
[13.58.202.233] Configuring ephemeral storage...
[18.116.204.133] Configuring ephemeral storage...
[18.188.59.110] Configuring ephemeral storage...
[18.216.84.221] Configuring ephemeral storage...
[3.137.151.208] Configuring ephemeral storage...
[18.223.162.153] Configuring ephemeral storage...
[52.15.106.68] Configuring ephemeral storage...
[13.58.48.119] Configuring ephemeral storage...
[52.14.171.25] Configuring ephemeral storage...
[18.221.197.55] Configuring ephemeral storage...
[13.58.189.39] Configuring ephemeral storage...
[3.131.94.95] Configuring ephemeral storage...
Java 8 is already installed, skipping Java install
[3.18.112.6] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.216.105.247] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.221.206.110] Installing HDFS...
Java 8 is already installed, skipping Java install
[3.137.151.208] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.219.151.70] Installing HDFS...
Java 8 is already installed, skipping Java install
[52.14.171.25] Installing HDFS...
Java 8 is already installed, skipping Java install
[3.15.177.192] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.116.204.133] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.216.84.221] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.219.138.172] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.221.197.55] Installing HDFS...
Java 8 is already installed, skipping Java install
[13.58.189.39] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.223.162.153] Installing HDFS...
Java 8 is already installed, skipping Java install
[3.141.106.37] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.222.201.80] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.188.57.208] Installing HDFS...
Java 8 is already installed, skipping Java install
[3.141.105.193] Installing HDFS...
Java 8 is already installed, skipping Java install
[13.58.202.233] Installing HDFS...
Java 8 is already installed, skipping Java install
[13.58.48.119] Installing HDFS...
Java 8 is already installed, skipping Java install
[52.15.106.68] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.188.59.110] Installing HDFS...
Java 8 is already installed, skipping Java install
[18.216.174.79] Installing HDFS...
Java 8 is already installed, skipping Java install
[3.131.94.95] Installing HDFS...
[18.116.204.133] Installing Spark...
[18.216.174.79] Installing Spark...
[3.137.151.208] Installing Spark...
[3.18.112.6] Installing Spark...
[18.221.197.55] Installing Spark...
[13.58.189.39] Installing Spark...
[18.216.84.221] Installing Spark...
[52.14.171.25] Installing Spark...
[18.216.105.247] Installing Spark...
[52.15.106.68] Installing Spark...
[3.141.105.193] Installing Spark...
[18.223.162.153] Installing Spark...
[18.188.57.208] Installing Spark...
[18.219.138.172] Installing Spark...
[18.219.151.70] Installing Spark...
[18.221.206.110] Installing Spark...
[18.222.201.80] Installing Spark...
[13.58.48.119] Installing Spark...
[3.141.106.37] Installing Spark...
[3.15.177.192] Installing Spark...
[18.188.59.110] Installing Spark...
[13.58.202.233] Installing Spark...
[3.131.94.95] Installing Spark...
[18.219.151.70] Configuring HDFS master...
[18.219.151.70] Configuring Spark master...
HDFS online.
Spark online.
launch finished in 0:02:58.
Cluster master: ec2-18-219-151-70.us-east-2.compute.amazonaws.com
Login with: flintrock login spark-cluster
(aws)[mkhan@inv36 ~]$ flintrock destroy spark-cluster
spark-cluster:
  state: running
  node-count: 23
  master: ec2-18-219-151-70.us-east-2.compute.amazonaws.com
  slaves:
    - ec2-13-58-202-233.us-east-2.compute.amazonaws.com
    - ec2-18-216-84-221.us-east-2.compute.amazonaws.com
    - ec2-18-216-174-79.us-east-2.compute.amazonaws.com
    - ec2-3-141-106-37.us-east-2.compute.amazonaws.com
    - ec2-18-223-162-153.us-east-2.compute.amazonaws.com
    - ec2-13-58-189-39.us-east-2.compute.amazonaws.com
    - ec2-3-18-112-6.us-east-2.compute.amazonaws.com
    - ec2-18-188-57-208.us-east-2.compute.amazonaws.com
    - ec2-52-14-171-25.us-east-2.compute.amazonaws.com
    - ec2-18-219-138-172.us-east-2.compute.amazonaws.com
    - ec2-3-15-177-192.us-east-2.compute.amazonaws.com
    - ec2-18-221-197-55.us-east-2.compute.amazonaws.com
    - ec2-18-188-59-110.us-east-2.compute.amazonaws.com
    - ec2-18-116-204-133.us-east-2.compute.amazonaws.com
    - ec2-18-216-105-247.us-east-2.compute.amazonaws.com
    - ec2-3-137-151-208.us-east-2.compute.amazonaws.com
    - ec2-18-221-206-110.us-east-2.compute.amazonaws.com
    - ec2-18-222-201-80.us-east-2.compute.amazonaws.com
    - ec2-3-131-94-95.us-east-2.compute.amazonaws.com
    - ec2-3-141-105-193.us-east-2.compute.amazonaws.com
    - ec2-13-58-48-119.us-east-2.compute.amazonaws.com
    - ec2-52-15-106-68.us-east-2.compute.amazonaws.com
Are you sure you want to destroy this cluster? [y/N]: y
Destroying spark-cluster...
Traceback (most recent call last):
  File "/home/mkhan/anaconda/anaconda3/envs/aws/bin/flintrock", line 8, in <module>
    sys.exit(main())
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/flintrock.py", line 1247, in main
    cli(obj={})
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/flintrock.py", line 577, in destroy
    cluster.destroy()
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/ec2.py", line 200, in destroy
    cluster_group.delete()
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/botocore/client.py", line 276, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/botocore/client.py", line 586, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (DependencyViolation) when calling the DeleteSecurityGroup operation: resource sg-0abb9850b14714211 has a dependent object
(aws)[mkhan@inv36 ~]$ flintrock destroy spark-cluster
spark-cluster:
  state: running
  node-count: 2
  master:
Traceback (most recent call last):
  File "/home/mkhan/anaconda/anaconda3/envs/aws/bin/flintrock", line 8, in <module>
    sys.exit(main())
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/flintrock.py", line 1247, in main
    cli(obj={})
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/flintrock.py", line 571, in destroy
    cluster.print()
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/ec2.py", line 455, in print
    ['  slaves:'] + (self.slave_hosts if self.num_slaves > 0 else [])))
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/ec2.py", line 115, in slave_hosts
    if self.private_network:
  File "/home/mkhan/anaconda/anaconda3/envs/aws/lib/python3.8/site-packages/flintrock/ec2.py", line 86, in private_network
    return not ec2.Subnet(self.master_instance.subnet_id).map_public_ip_on_launch
AttributeError: 'NoneType' object has no attribute 'subnet_id'
nchammas commented 3 years ago

Are you able to reproduce this issue consistently, or is it intermittent?

Is there something special about requesting 25 instances, or do you also see the issue with smaller clusters?

mkhan037 commented 3 years ago

I tried to reproduce this issue by using the same command 4 more times. However, this did not happen again. I did encounter issue #340 one of the time.

As for the number 25, there is nothing special other than I was trying to reproduce issue #340 with more than 20 nodes, as it was mentioned there that launching more than 20 nodes can trigger that issue. I did not encounter any issues with small clusters.

nchammas commented 3 years ago

Strange. If this happens to you again and you get some more detail on what the cause is, please post it here.

From your original report, I can see that the initial problem is this:

botocore.exceptions.ClientError: An error occurred (DependencyViolation) when calling the 
DeleteSecurityGroup operation: resource sg-0abb9850b14714211 has a dependent object

I'm guessing that happened only once because the cause is related to AWS resource consistency. Maybe Flintrock deleted something that sg-0abb9850b14714211 depends on, but there was a brief moment when that change didn't propagate sufficiently throughout AWS. So when Flintrock went to delete sg-0abb9850b14714211 itself, we got this dependency violation.

That's just a guess. But if that was the issue, then I suppose a general solution would be to adjust some boto settings to retry operations more times before failing (though I don't think those settings apply to DependencyViolation errors), or perhaps to add waiters on specific operations that are known to take time to propagate out, like deleting a security group dependency (e.g. another security group, or a security group rule).

mkhan037 commented 3 years ago

Strange. If this happens to you again and you get some more detail on what the cause is, please post it here. If I reencounter this issue, I will surely let you know. Thanks again for this awesome tool.

nchammas commented 3 years ago

Thanks for the report!