nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Flintrock crashes if it can't delete security group #219

Open thagorx opened 6 years ago

thagorx commented 6 years ago

Flintrock chrashes when I try to destroy a cluster. The problem occurs because I use the cluster security group to allow the cluster access to an EFS drive. When I try to destroy the cluster flintrock crashes with the error messages that it could not delete the security group because it is still used by another object. In my case the the security group is still used by the security group used for the EFS .

nchammas commented 6 years ago

So you reference the Flintrock security group in another security group, right? I'm not sure how Flintrock can handle this case. Flintrock will never touch any resources it did not create itself, so it is strictly out of bounds to expect Flintrock to go and modify the rules in a non-Flintrock security group.

This doesn't seem like a problem to me. If you create a security group dependency that Flintrock does not manage, then it's on you to remove that dependency before trying to destroy the cluster. And it sounds like the error message you get when you don't do that is clear enough to suggest what the issue is.

Is there some other approach you think we should take here?

thagorx commented 6 years ago

Yes I reference the Flintrock security group in another group. It is not a problem that Flintrock fails to delete it's own security group, but rather that it fails with a exception (see attachment). The result is that the ec2 instances do not get shutdown. I can do it manually but would it not be more graceful to simply display an error message and continue with the cluster shutdown? cluster destroy exception.txt

nchammas commented 6 years ago

There are some nitpicky details we need to keep in mind, but I think yes, it is possible to terminate the instances before trying to delete the security group.

Basically, the destroy procedure needs to do things in this order:

  1. Detach the cluster security group from instances.
  2. Terminate the instances.
  3. Delete the security group.

If you want to submit a PR to fix this, go ahead! I'd be happy to review it and guide you through the process. Otherwise, I'll put this on my list to tackle later.