tnc-ca-geo / animl-api

Backend for https://animl.camera
Other
4 stars 0 forks source link

Batch inference failures not getting kicked back to the queue? #132

Closed nathanielrindlaub closed 11 months ago

nathanielrindlaub commented 11 months ago

@ingalls, one of our users just reported an odd issue - she had set up an automation rule to run Megadetector, then an automation rule to run MIRAv2 (our Santa Cruz Island species classifier), and then she uploaded a few hundred images via bulk upload. She got predictions back from for all the images from Megadetector, but very view for MIRA.

I looked into this, and almost all of the images had failed inference with an error that read Error: ModelNotReadyException: Model for endpoint mirav2-concurrency-80 variant AllTraffic is not ready for inference yet..

I think that that particular Sagemaker serverless endpoint (the high-concurrency bulk endpoint for that classifier) was extremely cold and so may have had some extra start-up time that led to it throwing those errors.

That should have been fine, however, if the messages had been returned to the queue and retried. However, it doesn't look like any of them were tried more than once (see this log as an example: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events$3FfilterPattern$3D$2522island_spotted_skunks$253A8a14ff6ce1b33b931dd6141e9458b379$2522$26start$3D1699948800000$26end$3D1700121599000).

Do you have thoughts as to why that may be? Are we sure the Partial Batch Response implementation is working? I just quickly scanned the best-practices and it looks like we're doing it all by-the-book, so it's kind of puzzling.

Here's an example in the logs of a whole "batch" of messages failing: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Flambda$252Faniml-api-prod-batchinference/log-events/2023$252F11$252F14$252F$255B$2524LATEST$255D2fdb7e2386b843bc901b8117b7c0039d$3Fstart$3D1699997724367$26refEventId$3D37911216091420435494681423085604066914917414767427190816

ingalls commented 11 months ago

@nathanielrindlaub This one is sadly a little hard to debug, I've gone through and definitely am seeing the same behavior that you are. I'm not sure why yet. A few questions

ingalls commented 11 months ago

image

It doesn't appear as though the DLQ reported to CloudWatch anything other than a 0 state

Batch ID: c7ed3aed-cb0c-4063-a7b4-d61f81f57979

ingalls commented 11 months ago

Inference Queue: image

ingalls commented 11 months ago

I'm not seeing a clear call by the ingest-delete function to clear up the stack, was the stack deleted manually?

nathanielrindlaub commented 11 months ago

No definitely not.

On Fri, Nov 17, 2023 at 9:55 AM Nick @.***> wrote:

I'm not seeing a clear call by the ingest-delete function to clear up the stack, was the stack deleted manually?

— Reply to this email directly, view it on GitHub https://github.com/tnc-ca-geo/animl-api/issues/132#issuecomment-1816853296, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOFPICGXODYPTSD2H6DHZDYE6QIFAVCNFSM6AAAAAA7NPZP4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJWHA2TGMRZGY . You are receiving this because you were mentioned.Message ID: @.***>

ingalls commented 11 months ago

We've traced the error down to the user calling the StopBatch function which cleared the SQS queue before the 5 retries could be run.

image

We manually uploaded a new set of images that hit another cold inferencing lambda and observed similiar behavior. However without the StopBatch call the images were retried as expected and upon hitting a now warm inferencer, succeeded.