Investigate issue with stack getting deleted prematurely

nathanielrindlaub commented 5 months ago

On 2/29/24, one of the users of the island_spotted_skunks project initiated a batch (see batch record below) for which all image records were successfully created in the DB, but after saving the image records, when the animl-api attempted to post a message to the batches inference queue here, it returned the following error:

Image Creation Error ApolloError [Error]: QueueDoesNotExist: The specified queue does not exist.

Here's the timeline of what seems to have happened from what I can piece together from the logs:

Zip upload is successful
AWS Batch Job is initiated, and at 2024-02-29T14:33:58.636-08:00 updates the batch record with uploadComplete: "2024-02-29T22:33:58.635Z" (see Batch logs)
The batch-processing stack infrastructure is successfully provisioned by cloudformation, and the batch record is updated with processingStart: "2024-02-29T22:34:39.462Z".
The first unzipped image is successfully moved to the ingestion bucket at 2024-02-29T14:35:10.627-08:00 (see Ingestion lambda logs). That image was saved and and processed just fine (see image in Animl here: https://animl.camera/app/island_spotted_skunks/653bf7c68697390290086b18?img=island_spotted_skunks:8f5dc7f9eec34e29dee80628c652afdc)
Images continued to get saved and processed correctly for a bit...
Until 2024-02-29T14:48:21.898-08:00, at which point the clouwatch metric alarm triggered a delete stack event (see IngestDelete logs). This alarm gets triggered when the SQS queue is empty ("10 out of the last 10 datapoints were less than or equal to the threshold (0.0)."). The stack and queue gets deleted.
However, there are still many more images to save and process! They continue to get saved to MongoDB, but when we attempt to submit inference request messages to the batch SQS queue, the queue can't be found, and it generates an ImageError saying as much, returns that error to the Ingestion lambda, and it saves those images to the dead-letter-bucket (see early example of that here in the image-ingest logs).

So the question is, why was the SQS "empty" for long enough to trigger the alarm?? If I'm reading out Cloudformation template correctly, the metric polling period is 60 seconds, and it has to return 0 messages for 10 consecutive polls in a row, so how would the queue be empty for 10 minutes?

I can understand the queues potentially being empty very briefly if the inference handler is reading in messages and processing them faster than they can be saved to MongoDB, but I don't have an explanation for a >10 minute period of emptiness.

Batch record in question:

{
  "_id": "batch-c2d26855-26b1-4e45-9a59-29d0cc052156",
  "projectId": "island_spotted_skunks",
  "user": "2987c264-2cdd-4f79-97c6-fbf719eb0ecb",
  "uploadedFile": "batch-c2d26855-26b1-4e45-9a59-29d0cc052156.zip",
  "originalFile": "SH4.zip",
  "created": {
    "$date": "2024-02-29T20:41:46.821Z"
  },
  "__v": 0,
  "overrideSerial": "B-297",
  "total": 22889,
  "uploadComplete": {
    "$date": "2024-02-29T22:33:58.635Z"
  },
  "processingStart": {
    "$date": "2024-02-29T22:34:39.462Z"
  },
  "processingEnd": {
    "$date": "2024-02-29T22:48:22.161Z"
  },
  "ingestionComplete": {
    "$date": "2024-02-29T22:57:39.549Z"
  }
}

nathanielrindlaub commented 5 months ago

The user ran the same batch again without issue, so it doesn't appear to be related to that zip file in particular.

ingalls commented 5 months ago

@nathanielrindlaub I've done some poking and the only thing I can find is the edge case that we are already familair with - namely that on very small zip files that the SQS Queue clears so fast that the Approx number of messages doesn't have time to spike, resulting in the stack getting terminated as it is processing - this is exactly why I set the alarm to be for a fairly large period - to enure small zip files had time to clear even if the stack was slated for deletion.

nathanielrindlaub commented 5 months ago

@ingalls hmm but it doesn't seem like that could have been the case here... this zip had like 15k images and judging from the logs we were successfully sending messages to the queue without interruption from the moment the processing began to the moment the delete alarm went off.

tnc-ca-geo / animl-ingest

Investigate issue with stack getting deleted prematurely #79