mlevit / aws-auto-cleanup

Programmatically delete AWS resources based on an allowlist and time to live (TTL) settings
MIT License
495 stars 54 forks source link

RDS snapshots are not being cleaned up #111

Closed atqhg23 closed 2 years ago

atqhg23 commented 2 years ago

Describe the bug RDS snapshots are not being deleted. The lambda logs show that the cleanup of RDS snapshots started and finished, but the RDS snapshots are not deleted. Tried adding a print statement to the for loop located here (https://github.com/servian/aws-auto-cleanup/blob/main/app/src/rds_cleanup.py#L156) and it is not being printed, so it is not getting here.

To Reproduce Steps to reproduce the behavior:

  1. Enable RDS snapshot cleanup
  2. Run the cleanup and check if RDS snapshots are deleted

Expected behavior RDS snapshots older then the date specified should be deleted

AWS (please complete the following information):

mlevit commented 2 years ago

Hey @atqhg23 thanks for raising this. This cleanup only covers manual snapshots taken of RDS instances. As you can see on line 147 we specifically filter our snapshot types.

This is generally done because automatic RDS snapshots should already be cleaned up automatically by AWS based on your snapshot retention period.

Can you please validate this and come back to me with your findings.

atqhg23 commented 2 years ago

Thanks for the response. I confirmed that all of the snapshots in the account are manual under Snapshot type, but they're still not showing up.

When checking the lambda logs, it just mentions it started and finished the cleanup of RDS snapshots.

The execution log is also not showing any RDS snapshots.

mlevit commented 2 years ago

@atqhg23 could you please run the following CLI command for me and show me the results. Please remove any PII data first:

aws rds describe-db-snapshots --region us-east-1 --output json
atqhg23 commented 2 years ago

Output below. Interesting to point out is that this command is only showing automated snapshots even though I have a few manual snapshots in my account as well.

{
    "DBSnapshots": [
        {
            "DBSnapshotIdentifier": "rds:rds-ex-123",
            "DBInstanceIdentifier": "rds-ex-456",
            "SnapshotCreateTime": "2022-05-02T17:03:20.585000+00:00",
            "Engine": "mysql",
            "AllocatedStorage": 5,
            "Status": "available",
            "Port": 3306,
            "AvailabilityZone": "us-east-1b",
            "VpcId": "vpc-123",
            "InstanceCreateTime": "2022-05-02T16:51:24.366000+00:00",
            "MasterUsername": "masteruserhere",
            "EngineVersion": "5.7.37",
            "LicenseModel": "general-public-license",
            "SnapshotType": "automated",
            "OptionGroupName": "default:mysql-5-7",
            "PercentProgress": 100,
            "StorageType": "gp2",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-east-1:123456123456:key/abcdef-abcdef-abcdef",
            "DBSnapshotArn": "arn:aws:rds:us-east-1:123456123456:snapshot:rds:rds-ex-123",
            "IAMDatabaseAuthenticationEnabled": false,
            "ProcessorFeatures": [],
            "DbiResourceId": "db-abcdefabcdefabcdef",
            "TagList": [
                {
                    "Key": "key",
                    "Value": "val"
                }
            ],
            "OriginalSnapshotCreateTime": "2022-05-02T17:03:20.585000+00:00",
            "SnapshotTarget": "region"
        },
        {
            "DBSnapshotIdentifier": "rds:rds-ex-456-2022-05-03-04-08",
            "DBInstanceIdentifier": "rds-ex-456",
            "SnapshotCreateTime": "2022-05-03T04:08:15.512000+00:00",
            "Engine": "mysql",
            "AllocatedStorage": 5,
            "Status": "available",
            "Port": 3306,
            "AvailabilityZone": "us-east-1b",
            "VpcId": "vpc-123",
            "InstanceCreateTime": "2022-05-02T16:51:24.366000+00:00",
            "MasterUsername": "masteruserhere",
            "EngineVersion": "5.7.37",
            "LicenseModel": "general-public-license",
            "SnapshotType": "automated",
            "OptionGroupName": "default:mysql-5-7",
            "PercentProgress": 100,
            "StorageType": "gp2",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-east-1:123456123456:key/abcdef-abcdef-abcdef",
            "DBSnapshotArn": "arn:aws:rds:us-east-1:123456123456:snapshot:rds:rds-ex-456-2022-05-03-04-08",
            "IAMDatabaseAuthenticationEnabled": false,
            "ProcessorFeatures": [],
            "DbiResourceId": "db-abcdefabcdefabcdef",
            "TagList": [
                {
                    "Key": "key",
                    "Value": "val"
                }
            ],
            "OriginalSnapshotCreateTime": "2022-05-03T04:08:15.512000+00:00",
            "SnapshotTarget": "region"
        },
        {
            "DBSnapshotIdentifier": "rds:rds-bcd-456-2022-05-02-17-02",
            "DBInstanceIdentifier": "rds-bcd-456",
            "SnapshotCreateTime": "2022-05-02T17:02:08.365000+00:00",
            "Engine": "postgres",
            "AllocatedStorage": 5,
            "Status": "available",
            "Port": 5432,
            "AvailabilityZone": "us-east-1b",
            "VpcId": "vpc-123",
            "InstanceCreateTime": "2022-05-02T16:50:52.263000+00:00",
            "MasterUsername": "masteruserhere",
            "EngineVersion": "11.13",
            "LicenseModel": "postgresql-license",
            "SnapshotType": "automated",
            "OptionGroupName": "default:postgres-11",
            "PercentProgress": 100,
            "StorageType": "gp2",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-east-1:123456123456:key/abcdef-abcdef-abcdef",
            "DBSnapshotArn": "arn:aws:rds:us-east-1:123456123456:snapshot:rds:rds-bcd-456-2022-05-02-17-02",
            "IAMDatabaseAuthenticationEnabled": false,
            "ProcessorFeatures": [],
            "DbiResourceId": "db-AAABBBCCCDDDD",
            "TagList": [
                {
                    "Key": "key",
                    "Value": "val"
                }
            ],
            "OriginalSnapshotCreateTime": "2022-05-02T17:02:08.365000+00:00",
            "SnapshotTarget": "region"
        },
        {
            "DBSnapshotIdentifier": "rds:rds-bcd-456-2022-05-03-04-14",
            "DBInstanceIdentifier": "rds-bcd-456",
            "Engine": "postgres",
            "AllocatedStorage": 5,
            "Status": "creating",
            "Port": 5432,
            "AvailabilityZone": "us-east-1b",
            "VpcId": "vpc-123",
            "InstanceCreateTime": "2022-05-02T16:50:52.263000+00:00",
            "MasterUsername": "masteruserhere",
            "EngineVersion": "11.13",
            "LicenseModel": "postgresql-license",
            "SnapshotType": "automated",
            "OptionGroupName": "default:postgres-11",
            "PercentProgress": 1,
            "StorageType": "gp2",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-east-1:123456123456:key/abcdef-abcdef-abcdef",
            "DBSnapshotArn": "arn:aws:rds:us-east-1:123456123456:snapshot:rds:rds-bcd-456-2022-05-03-04-14",
            "IAMDatabaseAuthenticationEnabled": false,
            "ProcessorFeatures": [],
            "DbiResourceId": "db-AAABBBCCCDDDD",
            "TagList": [
                {
                    "Key": "key",
                    "Value": "val"
                }
            ],
            "SnapshotTarget": "region"
        }
    ]
}
atqhg23 commented 2 years ago

Ah I see the issue my bad, I just confirmed that this is actually working as expected.

The only manual snapshots I had were of Aurora instances which are Cluster snapshots. I created a manual snapshot of a MySQL instance and this snapshot was shown in the output of the command you mentioned above, and I confirmed the cleanup deleted this snapshot.

Any thoughts on expanding the cleanup to include RDS clusters and cluster snapshots?

Thanks again for the help with this.

mlevit commented 2 years ago

Shouldn't be too hard. Should be able to add support in the next few days 🤞

atqhg23 commented 2 years ago

Thank you! I can test this as soon as it's available

mlevit commented 2 years ago

Give it a go, https://github.com/servian/aws-auto-cleanup/tree/rds-clusters

atqhg23 commented 2 years ago

Thanks! I just tested this and validated that both clusters and cluster snapshots are getting deleted.

I had a question about one thing though. A cluster can't be deleted if it contains DB instances which is fine since the next cleanup run can delete the clusters.

This is the error from CloudTrail that came up when trying to delete a cluster with an instance in it.

"errorCode": "InvalidDBClusterStateFault",

"errorMessage": "Cluster cannot be deleted, it still contains DB instances in non-deleting state.",

But if a cluster has scaling enabled, I think these clusters would never get deleted since they would scale back up allowing these clusters to remain. Would it be possible to remove scaling from the clusters that are targeted as part of the cleanup allowing the next cleanup run to delete the clusters?

mlevit commented 2 years ago

Oh right, wasn't aware that a Cluster actually contains DB Instances. Makes sense. What we can do then is run the cleanup of DB Instances first, then Clusters. This will create a bit of complexity whereby you can whitelist the Instance and therefore prevent the Cluster from ever being deleted. You of course might not understand this behaviour.

The other way to approach this is, if the Cluster is to be deleted, we delete all the DB Instances within the Cluster first then the Cluster.

Thoughts?

atqhg23 commented 2 years ago

I think the second approach would be better, but I imagine the same thing would happen where if an instance from a cluster is whitelisted, then the cluster won't be deleted, but I think that makes sense since I would think if a user is whitelisting an instance inside a cluster, they would also want to keep the cluster.

mlevit commented 2 years ago

However, the Cluster can have multiple Instances, so do we keep only the ones allowlisted or check if any are before deleting or skipping?

atqhg23 commented 2 years ago

Good point, to be honest I'm not sure about this.

Did some brief research and found some info that may be useful:

Link: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.CreateInstance.html

By default, an Aurora DB cluster contains a primary DB instance that performs 
reads and writes, and, optionally, up to 15 Aurora Replicas (reader DB instances)

More info:

I'm thinking if a cluster is allowlisted, all the instances within should be allowlisted as well, and vice versa where if an instance within a cluster is allowlisted, the cluster itself should be allowlisted which would allowlist all instances within it.

With the rest of the instances being reader instances though, this makes me lean towards just keeping the instances that are allowlisted and deleting the rest within the cluster.

atqhg23 commented 2 years ago

Thinking about this some more, if I'm thinking about this right, there would be 2 scenarios:

Cluster with scaling enabled

Cluster with scaling disabled

All this to say that I think only keeping the allowlisted instances makes the most sense since it'll have a cost saving impact for clusters without scaling, and won't really impact clusters with scaling enabled since the scaling policy takes over.

atqhg23 commented 2 years ago

Got some insight from someone more knowledgeable with RDS and they mentioned that whitelisting should happen at the cluster level, either keep or remove everything since it could get tricky to try to remove specific instances from a cluster

mlevit commented 2 years ago

Thanks for that insight @atqhg23. I've made a change to the code to remove DB Instances present within the Cluster before removing the Cluster.

Can you pull the latest code and validate it?

atqhg23 commented 2 years ago

Thanks! I just tested this and here is what I found:

mlevit commented 2 years ago

Thanks very much for that. Looks like it's working as expected. I'll add some logging for the deletion of the DB Instances within the Cluster.

To your third point regarding allowlisting DB Instances, I went off your last comment which was, if the Cluster is not allowlisted, then all DB Instances within the Cluster are deleted no matter if they are allowlisted or not.

I think this is much easier to understand for everyone. Allowlist a DB Instance if it's not Cluster-based, or allowlist the Cluster.

atqhg23 commented 2 years ago

Awesome, thanks, I can test the logging

Yeah that makes sense

mlevit commented 2 years ago

The logging is there, just as DEBUG not INFO. I think DEBUG is enough as we're cleaning up the Cluster not the Instances here.

atqhg23 commented 2 years ago

Ah I thought you were referring to adding the action in the execution log for RDS instances inside a cluster. In the website, the action field in the execution log is empty for instances that are part of a cluster.

One other thing to mention which may not be that big of a deal is that the website is currently allowing users to add RDS instances that are part of a cluster to the allow list from the execution log which may confuse users thinking that their RDS instance that's part of a cluster will be whitelisted when it actually depends on whether the cluster is allowlisted or not.

mlevit commented 2 years ago

I changed the behaviour of the DB Instance cleanup to ignore Instances that are part of a Cluster.

See https://github.com/servian/aws-auto-cleanup/blob/rds-clusters/app/src/rds_cleanup.py#L269

Since we now added Cluster cleanup, DB Instances within Clusters will only be cleaned up as part of the Clusters cleanup.

It should resolve the issues you raised.

atqhg23 commented 2 years ago

I see, so this is working as intended, but it leaves the instances that are part of a cluster with an empty action since there is no resource_action for instances that are part of clusters. I think it would be solved by adding a resource_action for the instances in the clusters function. Testing this on my end at the moment

mlevit commented 2 years ago

It shouldn't. The execution log code only runs when a DB Instance is not part of a cluster.

https://github.com/servian/aws-auto-cleanup/blob/rds-clusters/app/src/rds_cleanup.py#L322

Maybe you have an older version before I fixed that :)

atqhg23 commented 2 years ago

Hmmm interesting, my code matches what's in the link above, but getting the same (picture below). An action is mentioned for individual instances and RDS clusters, but it's blank for all instances that are part of a cluster image

will look more into this, thanks again for all the help

atqhg23 commented 2 years ago

Ok I think I got it, I hope this helps better explain the issue. I added the two lines below above this line https://github.com/servian/aws-auto-cleanup/blob/rds-clusters/app/src/rds_cleanup.py#L269, and now an action is being shown for the instances that are part of a cluster.

if resource.get("DBClusterIdentifier"):
    resource_action = "PART OF CLUSTER"

This https://github.com/servian/aws-auto-cleanup/blob/rds-clusters/app/src/rds_cleanup.py#L269 was setting the resources action for instances outside of a cluster only, but now the resource_action gets set for instances within a cluster as well

mlevit commented 2 years ago

Thanks for that, but the issue here is that DB Instances within a cluster should be completely ignored during the cleanup of DB Instances. They should never appear in the execution log in the first place.

Let me run my own tests on this to validate just in case.

atqhg23 commented 2 years ago

Oh I see what you meant now, missed the mark on that, yeah that would be much better, thanks

atqhg23 commented 2 years ago

Hi, just following up, did you get a chance to test this on your end?

mlevit commented 2 years ago

I did, looks like it works as expected. Only the Cluster is showing up in the execution logs.

atqhg23 commented 2 years ago

Thanks, I see the issue, my code did not have the execution log shifted over.

Thanks again for all the help with this!