lookit / lookit-api

Codebase for Lookit v2 and Experimenter v2. Includes an API. Docs: http://lookit.readthedocs.io/
https://lookit.mit.edu/
MIT License
10 stars 18 forks source link

Video deletion backlog/maintenance #1431

Open becky-gilbert opened 1 month ago

becky-gilbert commented 1 month ago

Summary

Due to problems with our system for video deletion in S3 (#1423, #1430), we have a backlog of videos in S3 that are not in our DB and need to be deleted. We may also want to consider adding a task to check the S3 videos against those in our DB, so that any lingering S3 videos that should be deleted are cleaned up as part of regular maintenance.

Description

We recently found a problem with our S3 video deletion process, and as a result we will need to address the backlog of video files (~300) in S3 that should've been deleted. We can do this by:

  1. getting the file names from the "Video.DoesNotExist" Sentry error that is generated when a file could not be deleted, and/or
  2. comparing the video file names from S3 with those in our DB and removing any from S3 that do not exist in our DB.

One question is whether to do this "manually" (i.e triggered/monitored by a dev, though it could be partially automated with a script that generates a list of files and then deletes them via the AWS CLI), or via a fully-automated Celery task. If we were to do this via a Celery, we would need to put some safeguards in place to ensure that we never accidentally delete videos (e.g. if there were a database connection problem).

Proposal

I suggest we make this a fully-automated Celery task that does the following for all video storage buckets:

Implementation notes: