jckuester / awsweeper

A tool for cleaning your AWS account
Mozilla Public License 2.0
468 stars 45 forks source link

Sweep all resources not working #76

Closed aghassemlouei closed 4 years ago

aghassemlouei commented 4 years ago

When leveraging the entire all.yml services list and using 0.4.1 on macos 10.15.3 against AWS GovCloud regions, us-gov-west-1 and us-gov-east-1 where resource counts are higher than 800 per service awsweeper hangs and often needs to have the config file only include a subset or each service individually.

Had to break out ebs, eip, and security groups out to individual executions. Also, it appears as though vpc peering and public ip associations make it difficult to easily delete vpc's.

jckuester commented 4 years ago

Hi @aghassemlouei again. Thanks for providing the issue. I made some bigger changes and fixes commited to master in the last week, but haven't released them yet. Can you check if your problems still occur on master (or with v0.5.0, which I will release by tomorrow).

aghassemlouei commented 4 years ago

Evening @jckuester,

Just ran the following steps and ran into similar issues but with incremental improvements:

curl -LO https://github.com/cloudetc/awsweeper/releases/download/v0.5.0/terradozer-0.5.0-darwin-amd64.tar.gz
tar -xzf terradozer-0.5.0-darwin-amd64.tar.gz
chmod +x terradozer-0.5.0-darwin-amd64/terradozer
cat > custom.yml << EOF
aws_ami:
aws_autoscaling_group:
aws_cloudformation_stack:
aws_ebs_snapshot:
aws_ebs_volume:
aws_efs_file_system:
aws_eip:
aws_elb:
aws_instance:
aws_internet_gateway:
aws_key_pair:
aws_kms_alias:
aws_kms_key:
aws_launch_configuration:
aws_nat_gateway:
aws_network_acl:
aws_network_interface:
aws_route53_zone:
aws_db_instance:
aws_route_table:
aws_s3_bucket:
aws_security_group:
aws_subnet:
aws_vpc:
aws_vpc_endpoint:
EOF
./terradozer-0.5.0-darwin-amd64/terradozer --region us-gov-west-1 --profile canary --dry-run custom.yml

When executed all at once services wouldn't fully enumerate their resources, however, when broken out into smaller chunks .e.g., s3 buckets and rds, things did work. C

At least the s3 executions seem to be effective now so I closed out #71. When I let the execution run over the weekend apparently the vpc peering connections was throwing awssweeper/terraform for a loop with dependencies that couldn't be broken so that may also be something to take into consideration if folks just import the all.yml and execute it.

Thanks again for the quick release hopefully this data is useful and not bothersome!

jckuester commented 4 years ago

Thanks for your feedback. I haven't tested awsweeper at scale yet and your insights are very interesting and helpful - I'll do my best to improve your experience with the tool. Let's go into more detail about what you experienced:

jckuester commented 4 years ago

Hmm, I just looked into the code how Terraform deletes a VPC (see below). In the case you described, it is a DependencyViolation (because vpc peering connection still attached), so Terraform will retry deleting for 5 minutes. This is not what we really want and unfortunately the max_retries parameter mentioned above will not help here....

    err := resource.Retry(5*time.Minute, func() *resource.RetryError {
        _, err := conn.DeleteVpc(deleteVpcOpts)
        if err == nil {
            return nil
        }

        if isAWSErr(err, "InvalidVpcID.NotFound", "") {
            return nil
        }
        if isAWSErr(err, "DependencyViolation", "") {
            return resource.RetryableError(err)
        }
        return resource.NonRetryableError(fmt.Errorf("Error deleting VPC: %s", err))
    })
    if isResourceTimeoutError(err) {
        _, err = conn.DeleteVpc(deleteVpcOpts)
        if isAWSErr(err, "InvalidVpcID.NotFound", "") {
            return nil
        }
jckuester commented 4 years ago

Hi @aghassemlouei again. I thought about the problem again and came up with a solution. Let me know what you think.

awsweeper can now be run with a timeout for the delete operation, i.e., awsweeper --timeout 1s config.yml.

This way, if a VPC or any other resource still has a dependency, the delete times out in, for example, 1s (default is set to 20s). Here is how the output looks like:

   • SHOWING RESOURCES THAT WOULD BE DELETED (DRY RUN)

    ---
    Type: aws_vpc
    Found: 1

        Id:     vpc-1234
        Tags:       [Name: foo] 

    ---

   • TOTAL NUMBER OF RESOURCES THAT WOULD BE DELETED: 1
      • Are you sure you want to delete these resources (cannot be undone)? Only YES will be accepted.
        Enter a value: YES
   • STARTING TO DELETE RESOURCES
      • will retry to delete resource                      id=vpc-1234 type=aws_vpc
   • FAILED TO DELETE THE FOLLOWING RESOURCES (RETRIES EXCEEDED): 1
      • aws_vpc                                            error=destroy timed out (1s) id=vpc-1234
   • TOTAL NUMBER OF DELETED RESOURCES: 0
aghassemlouei commented 4 years ago

This worked significantly better! If for nothing else than the feedback presented to the end user. Syntax provided for posterity:

curl -LO https://github.com/cloudetc/awsweeper/releases/download/v0.7.0/awsweeper_0.7.0_darwin_amd64.tar.gz
tar -xzf awsweeper_0.7.0_darwin_amd64.tar.gz 
chmod +x awsweeper_0.7.0_darwin_amd64/awsweeper 
cat > custom.yml << EOF
aws_ami:
aws_autoscaling_group:
aws_cloudformation_stack:
aws_ecs_cluster:
aws_ebs_snapshot:
aws_ebs_volume:
aws_efs_file_system:
aws_eip:
aws_elb:
aws_iam_instance_profile:
aws_iam_role:
aws_instance:
aws_internet_gateway:
aws_key_pair:
aws_kms_alias:
aws_kms_key:
aws_lambda_function:
aws_launch_configuration:
aws_nat_gateway:
aws_network_acl:
aws_network_interface:
aws_db_instance:
aws_route53_zone:
aws_route_table:
aws_s3_bucket:
aws_security_group:
aws_subnet:
aws_vpc:
aws_vpc_endpoint:
EOF
./awsweeper_0.7.0_darwin_amd64/awsweeper --region us-gov-west-1 --profile core --timeout 1s custom.yml

The failure conditions were far more clear with a faster turnaround. The only cosmetic bit of feedback would be regarding the AWS-managed IAM roles or the KMS keys. Terraform seems to complain but it's definitely a non-issue:

error deleting IAM Role (AWSServiceRoleForSupport) policy attachments: Error deleting IAM Role AWSServiceRoleForSupport: UnmodifiableEntity: Cannot perform the operation on the protected role 'AWSServiceRoleForSupport' - this role is only modifiable by AWS

AccessDeniedException: User: arn:aws-us-gov:iam::123456789:user/aghassemlouei is not authorized to perform: kms:ScheduleKeyDeletion on resource: arn:aws-us-gov:kms:us-gov-west-1:123456789:key/1234567-1234-1234-1234-1234567

Closing this out as the major issues have been addressed; thanks for all your hard work!