Following on from https://github.com/guardian/grid/pull/4114 and lots of follow-up data mashing in Athena we now have a list of S3 paths we can safely delete (not in ES anymore, but that we have a copy in the replica bucket [and delete marker replication is OFF]) - this stands at over 50million in PROD, so we need a way to bulk delete...
Input file needs to be a CSV, with a heading row and a single column containing the S3 paths to delete from the specified bucket.
This script groups the input IDs into 1000s so it can use the bulk delete API and reports the success or failure for each S3 path to both the console but also the auditFile path provided (CSV output).
Note: bulk delete API reports 'deleted' if the path is not found, so this can be run multiple times without issue.
Following on from https://github.com/guardian/grid/pull/4114 and lots of follow-up data mashing in Athena we now have a list of S3 paths we can safely delete (not in ES anymore, but that we have a copy in the replica bucket [and delete marker replication is OFF]) - this stands at over 50million in PROD, so we need a way to bulk delete...
Usage: BulkDeleteS3Files <bucketName> <inputFile> <auditFile>
Input file needs to be a CSV, with a heading row and a single column containing the S3 paths to delete from the specified bucket.
This script groups the input IDs into 1000s so it can use the bulk delete API and reports the success or failure for each S3 path to both the console but also the
auditFile
path provided (CSV output).Note: bulk delete API reports 'deleted' if the path is not found, so this can be run multiple times without issue.
TESTED in
TEST
✅Processed 385k deletions in a few mins.