guardian / grid

The Guardian’s image management system
https://www.theguardian.com/info/developer-blog/2015/aug/12/open-sourcing-grid-image-service
Apache License 2.0
1.44k stars 120 forks source link

add script to batch delete files from S3 #4115

Closed twrichards closed 1 year ago

twrichards commented 1 year ago

Following on from https://github.com/guardian/grid/pull/4114 and lots of follow-up data mashing in Athena we now have a list of S3 paths we can safely delete (not in ES anymore, but that we have a copy in the replica bucket [and delete marker replication is OFF]) - this stands at over 50million in PROD, so we need a way to bulk delete...

Usage: BulkDeleteS3Files <bucketName> <inputFile> <auditFile>

Input file needs to be a CSV, with a heading row and a single column containing the S3 paths to delete from the specified bucket.

This script groups the input IDs into 1000s so it can use the bulk delete API and reports the success or failure for each S3 path to both the console but also the auditFile path provided (CSV output).

Note: bulk delete API reports 'deleted' if the path is not found, so this can be run multiple times without issue.

TESTED in TEST

Processed 385k deletions in a few mins.

prout-bot commented 1 year ago

Seen on auth, image-loader, metadata-editor, thrall, cropper, collections, kahuna (merged by @twrichards 9 minutes and 24 seconds ago) Please check your changes!

prout-bot commented 1 year ago

Seen on leases, usage, media-api (merged by @twrichards 9 minutes and 30 seconds ago) Please check your changes!