tnc-ca-geo / animl-api

Backend for https://animl.camera
4 stars 0 forks source link

Make `maintenance-mode` more bulletproof #186

Open nathanielrindlaub opened 2 months ago

nathanielrindlaub commented 2 months ago

When we are deploying major changes to prod and need to shut down inputs temporarily, we currently set both ingestion lambda and the frontend into maintenance-mode. For the ingestion Lambda, MAINTENANCE_MODE: true will pause the creation of new image records when images are uploaded to the ingestion bucket, and instead route those images to a "parking-lot" bucket where they live until we've completed the updates and set maintenance mode back to false and then we can move those images back to the ingestion bucket for processing.

When the frontend is in maintenance mode, a splash-screen is displayed that prevents users from accessing the app.

This works ok, but it's not perfect, as we learned today. There are two main problems:

  1. if the frontend is already loaded in a browser tab on a user's computer and they haven't refreshed it, they will still be able to access and interact with the frontend (edit labels, initiate bulk uploads) until they refresh the page and their cached files are updated with MAINTENANCE_MODE: true. So we need to figure out some way to force the user to refresh the page, perhaps by using Cognito to log out all users at once? Another idea might be to set up a maintenance mode for the graphql API, so that even if a user has access to the frontend, any actions they take would get rejected by the API.
  2. users may have initiated bulk uploads before we set the ingestion lambda into maintenance mode, and if the zip was received and the batch job was started before we turn on maintenance mode, the batch would validated and unzip those images, then move them to the ingestion bucket one-by-one, at which point the ingestion lambda would move them to the parking lot bucket (because it's now in maintenance mode), and the images would sit there with S3 keys that looks like <batchId>/path/to/image.jpg. That is fine until we move them from the parking lot bucket to the ingestion bucket manually, and because there's a batchId in the key, Animl assumes it's part of a batch. However, depending on how much time has elapsed, that batch's corresponding SQS queues may have been torn down already, so inference would fail.

For now, I think the low-tech solution to that issue will be to add a step in our production deployment workflows to manually check batch logs and the DB to make sure there aren't any fresh uploads that are in progress but haven't yet been fully unzipped. In the DB, those batches would have a created: <date_time> property but wouldn't yet have uploadComplete or processingStart or ingestionComplete fields. I'm not sure what a less manual approach might look like; I'd have to think some more on that.