Closed ingalls closed 1 year ago
Would the clean-up apply to stacks that may be executing as intended but are just taking a long time because they include a lot of images? Or is this strictly for failed/finished stacks that somehow didn't get deleted when their queue count dropped to zero?
@nathanielrindlaub I added the alarm checking so it will only delete a stack that has been around for >24 hours with an alarm state that has remained in INSUFFICIENT_DATA
which should not be true for stacks that are actively processing large batches.
I have also updated the Batch Size to 10. While this will inference just fine (sequential inferencing is supported), it will be greatly sped up by https://github.com/tnc-ca-geo/animl-api/pull/100 being merged and deployed
Context
Every hour an EventBridge cron is fired to remove stacks that got stuck in strange error states and failed to cleanup - one such example would be a stack that was created but did not end up being populated with images to process.
This PR adds the actual "action" to this EventBridge cron by listing all CloudFormation stacks and then filtering them by the following
STAGE
cc/ @nathanielrindlaub Open question here is how long should a stack be allowed to run/exist before it is cleaned up