Implement retry policy - Githubissues

tnc-ca-geo / animl-ingest

Lambda function for processing camera trap images

Other

0 stars 1 forks source link

There have been a few occasions in which the image metadata was successfully extracted and posted to the GraphQL API (and stored in the DB), but the transferring of the image(s) to S3 failed. This requires a bit more investigation as to why this is occurring (see the half-dozen images still stuck in limbo the ingestion bucket), but I think regardless we should:

[ ] increase the maximumRetryAttempts in serverless.yml from 0 to ~3 or so
[ ] track which operations (createImage call to API, saving images to S3 buckets) were successful, and
[ ] if one fails, attempt to reverse the previous successful operations (delete image from DB and any S3 buckets it might have successfully landed in)

@ingalls and I discussed this and took a look at how ingestion errors are handled now that we're writing them to the ImageErrors collection (see https://github.com/tnc-ca-geo/animl-api/pull/102), and concluded that we don't want to increase the maximumRetryAttempts because the first thing that should happen is that we store an image record and get an Image ID back, and all subsequent errors (including errors opening the image, resizing it, and copying it to S3), will get caught and written to the ImageErrors collection and will be exposed to the user. This is desirable behavior b/c we need the image record and image ID so that we can reference it in the ImageErrors collection if need be. So any retry attempts would attempt to save the image to the DB again and would throw a duplicate image error.

If an error is thrown in image-ingest, that image will still get moved to the dead-letter bucket, and it's likely due to an image corruption issue or bug in our code thus it's likely not that useful to try it again anyhow. Closing this out.

tnc-ca-geo / animl-ingest

Implement retry policy #35