Closed eddierubeiz closed 2 years ago
Since the checker runs nightly and has not notified us, it seems safe to assume running it manually will NOT find anything.
That doesn't mean no orphans are there.
Here are the errors we got that seem to suggest the app tried to delete files on S3, but got rate-limit errors from AWS when trying to delete, suggesting that some files must be left behind, right?
So the hard part will be finding those files.
Apr 13 09:46:04 scihist-digicoll-production app/worker.4: ERROR: [ActiveJob] [fa6e0d39-a098-4c42-b448-90fe11e10be6] Stopped retrying DeleteDziJob (Job ID: fa6e0d39-a098-4c42-b448-90fe11e10be6) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:06 scihist-digicoll-production app/worker.5: ERROR: [ActiveJob] [41b8bd74-2c4e-4f6c-a8ac-6938a13dca65] Stopped retrying DeleteDziJob (Job ID: 41b8bd74-2c4e-4f6c-a8ac-6938a13dca65) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:07 scihist-digicoll-production app/worker.7: ERROR: [ActiveJob] [e6c14226-919f-4837-9863-7a1ee7b0a84a] Stopped retrying DeleteDziJob (Job ID: e6c14226-919f-4837-9863-7a1ee7b0a84a) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:07 scihist-digicoll-production app/worker.4: ERROR: [ActiveJob] [1f96c524-0d81-459e-9793-f2340efc04d7] Stopped retrying DeleteDziJob (Job ID: 1f96c524-0d81-459e-9793-f2340efc04d7) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:07 scihist-digicoll-production app/worker.5: ERROR: [ActiveJob] [2528aa2d-019b-4237-9dca-d02730177208] Stopped retrying DeleteDziJob (Job ID: 2528aa2d-019b-4237-9dca-d02730177208) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:07 scihist-digicoll-production app/worker.6: ERROR: [ActiveJob] [c7fb0e57-97cf-4101-9e15-81b9ce245b7a] Stopped retrying DeleteDziJob (Job ID: c7fb0e57-97cf-4101-9e15-81b9ce245b7a) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Apr 13 09:46:08 scihist-digicoll-production app/worker.5: ERROR: [ActiveJob] [39a004d1-38dc-4b3c-8a68-5519a4e54204] Stopped retrying DeleteDziJob (Job ID: 39a004d1-38dc-4b3c-8a68-5519a4e54204) after 2 attempts, due to a Aws::S3::Errors::SlowDown (Please reduce your request rate.).
Warning Both PaperTrail and HoneyBadger only keep data for so long. You should be able to take those JobIDs and figure out exactly what Assets it was trying to delete from PaperTrail and/or HoneyBadger... but if you wait until those services purge the data as too old (7 days? Not sure), that could get impossible.
So you may want to at least try to capture any context avaialble from PaperTrail or Honeybadger now, before it's too late?
Correction: the orphan checker runs weekly on Wednesday nights.
No orphaned DZI tiles files found.
INFO: ** [Honeybadger] Initializing Honeybadger Error Tracker for Ruby. Ship it! version=4.9.0 framework=rails level=1 pid=6
Time: 00:26:20 Progress: |====================================================================================== | 34.07/s 53831/59522 90% ETA: 00:02:47
Total Asset count: 59522
Iterated through 53831 tile files on S3
Found 0 orphan files
HoneyBadger info is (currently) at https://app.honeybadger.io/projects/58989/faults/67258136 . It won't be there for much longer.
Note: PaperTrail does keep logs for a year (https://papertrailapp.com/account/archives and select April 13, then download compressed hourly log file 2022-04-13-13.tsv).
Checking the asset IDs listed in the failed jobs metadata for April 15th against s3 does suggest that 3 of the failed jobs did leave undeleted files in s3:
The assets for those IDs no longer exist, but we do have tiles for them.
I assume rerunning the DeleteDziJobs directly from the failed jobs queue would get rid of these 3 directories, but I don't want to destroy the evidence just yet ...
I assume rerunning the DeleteDziJobs directly from the failed jobs queue would get rid of these 3 directories, but I don't want to destroy the evidence just yet ...
Not totally sure if it would, it might get confused about the works no longer existing.
But once you've identifeid the scene of the crime (good forensics, thanks!), it's easy enough to just delete the specific S3 directory(ies) with either AWS console or aws
cli; I'd just do that when you're ready!
Sounds good.
Telling detail: in all three cases the .dzi
file itself was removed (which makes sense -- the dzi has the shallowest depth) but the md5_*_files
directory still has stuff in it.
This suggests 2 things to me; correct me if I'm wrong:
1) Our DZI orphan checker really only needs to look at folders at the top level of the s3 bucket; it doesn't need to care about the many, many individual files below that level. 2) I would really the DZI orphan checker to be able to successfully detect excess top-level folders in this way, even if we treat deleting them as an "extra credit" feature.
Yes, that all makes sense!
I remember the challenge with the DZI orphan checker was keeping it fast enough and affordable enough when dealing with literally millions of files. (there are currently 25 million files on production dzi bucket).
It's possible there's a way to do that and still do what you're asking for, now that we understand the situation and possibilities better. At the time, what I did was trying to do enough to find orphans while avoiding the AWS fee expense and slowness of dealing with millions of files/requests.
Remember that on S3 there aren't really folderes, just files with paths and shared path prefixes. So you can't necessarily save time/money by dealing "only with folders", you may still have to be under the hood dealing with all the files, including paying for it and waiting for it. To find out "what are all the top level folders" you may actually have to iterate every file. But I'm not sure, there might be a way! Be careful with AWS request fee $$ though!
Rerunning the 3 failed jobs did in fact delete the orphaned directories, which is good to know for next time this happens.
I'll create a ticket for looking at the s3 orphan DZI checker.
Run the DZI orphan checker manually and see if it catches anything ( we suspect it might, subsequent to AWS complaining about too many simultaneous deletion requests). Regardless, go over the code; Jonathan notes:
_Looks like we call delete_prefixed in shrine storage..._ _Then shrine storage... yep just calls SDK API with a prefix to make a "collection" of objects, and then batch_delete on the collection. I don't think there's any AWS API to delete files by prefix in one HTTP call._ This makes it hard to introduce a delay for not doing too many files at once, since it's several layers of dependency down doing the actual delete! Would have to stop using the shrine method or PR a feature to it I guess.