Closed nathanielrindlaub closed 11 months ago
@nathanielrindlaub This one is sadly a little hard to debug, I've gone through and definitely am seeing the same behavior that you are. I'm not sure why yet. A few questions
redriveBatch
functionality?It doesn't appear as though the DLQ reported to CloudWatch anything other than a 0 state
Batch ID: c7ed3aed-cb0c-4063-a7b4-d61f81f57979
Inference Queue:
I'm not seeing a clear call by the ingest-delete function to clear up the stack, was the stack deleted manually?
No definitely not.
On Fri, Nov 17, 2023 at 9:55 AM Nick @.***> wrote:
I'm not seeing a clear call by the ingest-delete function to clear up the stack, was the stack deleted manually?
— Reply to this email directly, view it on GitHub https://github.com/tnc-ca-geo/animl-api/issues/132#issuecomment-1816853296, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOFPICGXODYPTSD2H6DHZDYE6QIFAVCNFSM6AAAAAA7NPZP4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJWHA2TGMRZGY . You are receiving this because you were mentioned.Message ID: @.***>
We've traced the error down to the user calling the StopBatch function which cleared the SQS queue before the 5 retries could be run.
We manually uploaded a new set of images that hit another cold inferencing lambda and observed similiar behavior. However without the StopBatch call the images were retried as expected and upon hitting a now warm inferencer, succeeded.
@ingalls, one of our users just reported an odd issue - she had set up an automation rule to run Megadetector, then an automation rule to run MIRAv2 (our Santa Cruz Island species classifier), and then she uploaded a few hundred images via bulk upload. She got predictions back from for all the images from Megadetector, but very view for MIRA.
I looked into this, and almost all of the images had failed inference with an error that read
Error: ModelNotReadyException: Model for endpoint mirav2-concurrency-80 variant AllTraffic is not ready for inference yet.
.I think that that particular Sagemaker serverless endpoint (the high-concurrency bulk endpoint for that classifier) was extremely cold and so may have had some extra start-up time that led to it throwing those errors.
That should have been fine, however, if the messages had been returned to the queue and retried. However, it doesn't look like any of them were tried more than once (see this log as an example: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events$3FfilterPattern$3D$2522island_spotted_skunks$253A8a14ff6ce1b33b931dd6141e9458b379$2522$26start$3D1699948800000$26end$3D1700121599000).
Do you have thoughts as to why that may be? Are we sure the Partial Batch Response implementation is working? I just quickly scanned the best-practices and it looks like we're doing it all by-the-book, so it's kind of puzzling.
Here's an example in the logs of a whole "batch" of messages failing: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events/2023$252F11$252F14$252F$255B$2524LATEST$255D2fdb7e2386b843bc901b8117b7c0039d$3Fstart$3D1699997724367$26refEventId$3D37911216091420435494681423085604066914917414767427190816