Once upon a time, there was a process (in the form of a lambda) called 'the reaper' which deleted images (on a regular schedule) accordingly to a list of criteria, but was turned off out of caution after a significant chunk of images were permanently lost some years ago. This PR rebuilds 'the reaper', this time all within thrall.
Delete the old 'reaper' lambda (and any traces from CI scripts etc.)
Add new optional config property to ThrallConfig (s3.reaper.bucket in thrall.conf) to specify the bucket name where the permanent records of what was soft & hard deleted via the reaper will be stored (see https://github.com/guardian/editorial-tools-platform/pull/706 for Guardian) - defining this property is required for the reaper to operate
Two new endpoints to thrall both taking count query param (for the batch size, max 1000) ...
doBatchSoftReap which 'soft deletes' (with deletedBy being reaper) the oldest batch of is:reapable images which are not already-soft deleted
doBatchHardReap which 'hard deletes' the oldest batch of is:reapable images which have been in 'soft deleted' state for at least two weeks
The new ReaperController which defines contains the above endpoints also has a 'schedule' (every 15mins) which [IF the s3.reaper.bucket config property is defined, otherwise doesn't run]...
queries the number of images uploaded in the last 7 days, then divides that to get the number of images ingested per 15mins
calls the doBatchSoftReap and doBatchHardReap with the count as number of images ingested per 15mins - this ensures we delete at same rate we ingest for a given environment (at the Guardian, our TEST environment ingests roughly 1% of what PROD ingests)
we report the counts of images soft and hard reaped via new CloudWatch metrics
The new reaper process can be 'paused' by the presence of a file named PAUSED at the root of the new bucket s3.reaper.bucket. This is checked on each execution of the schedule, and exists early (with log message) if paused.
Lastly ReaperController provides a couple more endpoints
POST endpoint for pausing (creating that PAUSED file as described above)
POST endpoint for resuming from paused state (deleting that PAUSED file as described above)
GET endpoint for reading a record file from the bucket
finally, an HTML view of
whether Reaper is paused or not (with a button to flip that using the above)
a list of the last day's worth of record files
NOTE: we have the endpoints exposed (in addition to being called by the schedule) so that they can be manually called to for example clear a backlog if the reaper hasn't been running for whatever reason (either historically or because it was paused using the functionality above)
https://trello.com/c/DrGAH8Y0/893-turn-on-the-reaper
Once upon a time, there was a process (in the form of a lambda) called 'the reaper' which deleted images (on a regular schedule) accordingly to a list of criteria, but was turned off out of caution after a significant chunk of images were permanently lost some years ago. This PR rebuilds 'the reaper', this time all within
thrall
.Pre-requisite PRs:
What's changed
ThrallConfig
(s3.reaper.bucket
inthrall.conf
) to specify the bucket name where the permanent records of what was soft & hard deleted via the reaper will be stored (see https://github.com/guardian/editorial-tools-platform/pull/706 for Guardian) - defining this property is required for the reaper to operatethrall
both takingcount
query param (for the batch size, max 1000) ...doBatchSoftReap
which 'soft deletes' (with deletedBy beingreaper
) the oldest batch ofis:reapable
images which are not already-soft deleteddoBatchHardReap
which 'hard deletes' the oldest batch ofis:reapable
images which have been in 'soft deleted' state for at least two weeksReaperController
which defines contains the above endpoints also has a 'schedule' (every 15mins) which [IF thes3.reaper.bucket
config property is defined, otherwise doesn't run]...doBatchSoftReap
anddoBatchHardReap
with the count as number of images ingested per 15mins - this ensures we delete at same rate we ingest for a given environment (at the Guardian, ourTEST
environment ingests roughly 1% of whatPROD
ingests)PAUSED
at the root of the new buckets3.reaper.bucket
. This is checked on each execution of the schedule, and exists early (with log message) if paused.ReaperController
provides a couple more endpointsPOST
endpoint for pausing (creating thatPAUSED
file as described above)POST
endpoint for resuming from paused state (deleting thatPAUSED
file as described above)GET
endpoint for reading a record file from the bucketNOTE: we have the endpoints exposed (in addition to being called by the schedule) so that they can be manually called to for example clear a backlog if the reaper hasn't been running for whatever reason (either historically or because it was paused using the functionality above)