tnc-ca-geo / animl-ingest

Lambda function for processing camera trap images
Other
0 stars 1 forks source link

Stack Stragglers #57

Closed ingalls closed 1 year ago

ingalls commented 1 year ago

Context

Every hour an EventBridge cron is fired to remove stacks that got stuck in strange error states and failed to cleanup - one such example would be a stack that was created but did not end up being populated with images to process.

This PR adds the actual "action" to this EventBridge cron by listing all CloudFormation stacks and then filtering them by the following

cc/ @nathanielrindlaub Open question here is how long should a stack be allowed to run/exist before it is cleaned up

nathanielrindlaub commented 1 year ago

Would the clean-up apply to stacks that may be executing as intended but are just taking a long time because they include a lot of images? Or is this strictly for failed/finished stacks that somehow didn't get deleted when their queue count dropped to zero?

ingalls commented 1 year ago

@nathanielrindlaub I added the alarm checking so it will only delete a stack that has been around for >24 hours with an alarm state that has remained in INSUFFICIENT_DATA which should not be true for stacks that are actively processing large batches.

ingalls commented 1 year ago

I have also updated the Batch Size to 10. While this will inference just fine (sequential inferencing is supported), it will be greatly sped up by https://github.com/tnc-ca-geo/animl-api/pull/100 being merged and deployed