Batch inference failures not getting kicked back to the queue?

nathanielrindlaub commented 11 months ago

@ingalls, one of our users just reported an odd issue - she had set up an automation rule to run Megadetector, then an automation rule to run MIRAv2 (our Santa Cruz Island species classifier), and then she uploaded a few hundred images via bulk upload. She got predictions back from for all the images from Megadetector, but very view for MIRA.

I looked into this, and almost all of the images had failed inference with an error that read Error: ModelNotReadyException: Model for endpoint mirav2-concurrency-80 variant AllTraffic is not ready for inference yet..

I think that that particular Sagemaker serverless endpoint (the high-concurrency bulk endpoint for that classifier) was extremely cold and so may have had some extra start-up time that led to it throwing those errors.

That should have been fine, however, if the messages had been returned to the queue and retried. However, it doesn't look like any of them were tried more than once (see this log as an example: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events$3FfilterPattern$3D$2522island_spotted_skunks$253A8a14ff6ce1b33b931dd6141e9458b379$2522$26start$3D1699948800000$26end$3D1700121599000).

Do you have thoughts as to why that may be? Are we sure the Partial Batch Response implementation is working? I just quickly scanned the best-practices and it looks like we're doing it all by-the-book, so it's kind of puzzling.

Here's an example in the logs of a whole "batch" of messages failing: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events/2023$252F11$252F14$252F$255B$2524LATEST$255D2fdb7e2386b843bc901b8117b7c0039d$3Fstart$3D1699997724367$26refEventId$3D37911216091420435494681423085604066914917414767427190816