Stack Stragglers - Githubissues

ingalls commented 1 year ago

Context

Every hour an EventBridge cron is fired to remove stacks that got stuck in strange error states and failed to cleanup - one such example would be a stack that was created but did not end up being populated with images to process.

This PR adds the actual "action" to this EventBridge cron by listing all CloudFormation stacks and then filtering them by the following

Is a Batch Stack belonging to the given STAGE
Has not already been deleted
Is older than X (Currently 24 hours)

cc/ @nathanielrindlaub Open question here is how long should a stack be allowed to run/exist before it is cleaned up

nathanielrindlaub commented 1 year ago

Would the clean-up apply to stacks that may be executing as intended but are just taking a long time because they include a lot of images? Or is this strictly for failed/finished stacks that somehow didn't get deleted when their queue count dropped to zero?

ingalls commented 1 year ago

@nathanielrindlaub I added the alarm checking so it will only delete a stack that has been around for >24 hours with an alarm state that has remained in INSUFFICIENT_DATA which should not be true for stacks that are actively processing large batches.

ingalls commented 1 year ago

I have also updated the Batch Size to 10. While this will inference just fine (sequential inferencing is supported), it will be greatly sped up by https://github.com/tnc-ca-geo/animl-api/pull/100 being merged and deployed

tnc-ca-geo / animl-ingest

Stack Stragglers #57

Context