mitodl / ol-infrastructure

Infrastructure automation code for use by MIT Open Learning
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

Perodic Task to clean up AMI #669

Closed Ardiea closed 1 year ago

Ardiea commented 2 years ago

Continues #496

Ardiea commented 2 years ago

2022-03-31 13:23:46,055: custodian.policy:INFO policy:find-mitodl-ltv resource:aws.launch-template-version region:us-east-1 count:847 time:0.06
2022-03-31 13:23:50,161: custodian.policy:INFO policy:find-and-mark-mitodl-ami resource:aws.ami region:us-east-1 count:331 time:4.06
2022-03-31 13:23:50,177: custodian.actions:INFO Tagging 331 resources for deregister on 2022/04/30
2022-03-31 13:23:53,757: custodian.policy:INFO policy:find-and-mark-mitodl-ami action:tagdelayedaction resources:331 execution_time:3.58
2022-03-31 13:23:53,866: custodian.policy:INFO policy:find-and-mark-mitodl-ami-snapshots resource:aws.ebs-snapshot region:us-east-1 count:331 time:0.11
2022-03-31 13:23:53,873: custodian.actions:INFO Tagging 331 resources for delete on 2022/04/30
2022-03-31 13:23:57,350: custodian.policy:INFO policy:find-and-mark-mitodl-ami-snapshots action:tagdelayedaction resources:331 execution_time:3.48
2022-03-31 13:23:57,448: custodian.policy:INFO policy:find-and-mark-mitodl-ami-volumes resource:aws.ebs region:us-east-1 count:240 time:0.10
2022-03-31 13:23:57,452: custodian.actions:INFO Tagging 240 resources for delete on 2022/04/30
2022-03-31 13:24:00,485: custodian.policy:INFO policy:find-and-mark-mitodl-ami-volumes action:tagdelayedaction resources:240 execution_time:3.03```
Ardiea commented 1 year ago

@blarghmatey

The deed is done for this as originally written. It takes a few runs because it will automatically filter out snapshots associated with AMI and we remove AMIs last. Kinda confusing.

Run 1, Wants to delete a ton of snapshots but doesn't. Does deregister a ton of AMIs

2023-01-05 18:41:41,417: custodian.policy:INFO policy:delete-marked-ebs-volumes resource:aws.ebs region:us-east-1 count:0 time:0.96
2023-01-05 18:42:04,779: custodian.policy:INFO policy:delete-marked-ebs-snapshots resource:aws.ebs-snapshot region:us-east-1 count:2025 time:23.36
2023-01-05 18:42:15,280: custodian.ebs:INFO Deleting 0 snapshots, auto-filtered 2025 ami-snapshots
2023-01-05 18:42:15,294: custodian.policy:INFO policy:delete-marked-ebs-snapshots action:snapshotdelete resources:2025 execution_time:10.37
2023-01-05 18:42:16,018: custodian.policy:INFO policy:delete-marked-ami resource:aws.ami region:us-east-1 count:1797 time:0.72
2023-01-05 18:48:50,463: custodian.policy:INFO policy:delete-marked-ami action:deregister resources:1797 execution_time:394.14

Run 2: Wants to delete the same batch of snapshots and is successful this time. No AMIs to deregister, got them first time around.

2023-01-05 18:52:15,029: custodian.policy:INFO policy:delete-marked-ebs-volumes resource:aws.ebs region:us-east-1 count:0 time:0.84
2023-01-05 18:52:33,783: custodian.policy:INFO policy:delete-marked-ebs-snapshots resource:aws.ebs-snapshot region:us-east-1 count:2025 time:18.75
2023-01-05 18:52:55,313: custodian.ebs:INFO Deleting 2025 snapshots, auto-filtered 0 ami-snapshots
2023-01-05 18:56:54,288: custodian.policy:INFO policy:delete-marked-ebs-snapshots action:snapshotdelete resources:2025 execution_time:260.36
2023-01-05 18:56:54,643: custodian.policy:INFO policy:delete-marked-ami resource:aws.ami region:us-east-1 count:0 time:0.22

Run 3, Nothing happens:

2023-01-05 19:23:46,787: custodian.policy:INFO policy:delete-marked-ebs-volumes resource:aws.ebs region:us-east-1 count:0 time:0.83
2023-01-05 19:24:03,913: custodian.policy:INFO policy:delete-marked-ebs-snapshots resource:aws.ebs-snapshot region:us-east-1 count:0 time:17.12
2023-01-05 19:24:13,378: custodian.policy:INFO policy:delete-marked-ami resource:aws.ami region:us-east-1 count:0 time:9.46

So, based on the original custodian policies I created like a year ago, this is working as expected BUT ...... There is a disconnect we didn't consider. The original policies will only consider resources that were, at one time, associated with one of our (currently existing) launch templates.

There are at least two circumstances I can think off right away that things will slip through and become orphaned:

  1. Packer builds that were never associated with a running ASG/Launch Template. I can see this happening A LOT.
  2. Env destruction (the currently existing launch template doesn't retain the history of its linked AMIs). This seems like something that happens less frequently but is still possible.

I'm sure there are other circumstances as well. So, there is still work to do here, that will possibly need to be more nuanced in identifying resources for cleanup.