mlevit / aws-auto-cleanup

Programmatically delete AWS resources based on an allowlist and time to live (TTL) settings
MIT License
496 stars 55 forks source link

A SG in the cloudformation stack is unexpectedly deleted #110

Closed membra closed 1 year ago

membra commented 2 years ago

Describe the bug We noticed that at some point a security group that was part of a cloudformation stack got deleted while the rest of the stack remained.

This SG was marked as SKIP - IN USE during the destroy run image

At the same time in logs it was mentioned as deleting as there are no associations with it image

And we also see that it was actually deleted in cloudtrail image

At the same time the stack couldn't be deleted because of other resources in it that had dependencies.

That led to a situation when developers when needing to redeploy the stack - couldn't do it because one resource of the stack was independently deleted. And they also couldn't delete and recreate the stack because of those dependencies.

There are several questions we have, could you please clarify on them: 1) Is it correct that if resources are deleted through their own endpoints? so basically if a resource is part of a cloudformation stack - it is not deleted through cloudformation endpoint? But it is deleted through it's own - means independently from cloudformation? 2) Perhaps the non associated sg cleanup should only kick on resources not accosiated to cloudformation? Actually any similar cleanup should make that check. The tag is called aws:cloudformation:logical-id. That might increase processing time, of course, but the fact is with current approach it leads to some disruptive issues and with a lot of accounts like in our case it is very hard to cleanup manually after cleanup. What can be done to improve logic in that space? 3) In the execution log the resource was marked as SKIP - IN USE - but it was actually deleted. Could there be a valid reason for that?

Thank you

mlevit commented 2 years ago

Hey @membra,

  1. You're correct. Resources are independently deleted via their own API endpoints. The app will first attempt to delete the CloudFormation Stack before proceeding to delete all other resources. If the Stack was not deleted (for whatever reason) the resources will be deleted individually outside of the Stack.
  2. A potential solution is to check against CloudFormation Resources and only delete those not associated with a CloudFormation Stack. I could look into this over the next few days.
  3. This scenario does not seem possible. The log and the action taken (i.e., SKIP - IN USE) both originate from the same code section. In other words, if SKIP - IN USE was issued as the action, the log should be EC2 Security Group '{resource_id}' has a network association and cannot be deleted without deleting the association first. This is of course assuming this isn't a defect. Could you validate that the log and the execution log are from the exact same execution?
membra commented 2 years ago

hey @mlevit thank you

  1. Thank you, keep me updated please, I am keen on this one.
  2. On cw logs screenshot I attached ALL lines mentioning the sg are saying that it doesn't have association and is to be deleted. There are many lines there as I was multiple times running a dry_run mode until I ran a destroy mode. The destroy log then says it needs to be skipped and is in use. But actually deleted it.
mwgamble commented 1 year ago

@mlevit Would you be able to provide some information on why this issue won't be fixed? :pray:

mlevit commented 1 year ago

Hey @mwgamble sure thing.

  1. Cleanup of SGs is a lot more complex than I ever thought it would be. A lot of AWS resources are linked to SGs and each one needs to be cleaned up prior to removing the SG. I'd have to understand all the resources and incorporate their cleanups for this to properly work.
  2. I probably should have never added resource cleanup for resources like SGs, VPCs etc as the benefit isn't there. I should have only stuck with cleanup of expensive-to-run resources. My original intention was to clean up resources like SGs, VPCs etc just to keep the accounts neat and tidy, but the complexity of doing so makes fixing this problem not worth it.
mwgamble commented 1 year ago

Thanks for the explaination, I appreciate it :heart: