sul-dlss / preservation_catalog

Rails application to track, audit and replicate archival artifacts associated with SDR objects.
https://sul-dlss.github.io/preservation_catalog/
Other
2 stars 2 forks source link

Develop process for bulk remediation of PartReplicationAuditJob errors #1733

Closed justinlittman closed 1 year ago

justinlittman commented 3 years ago

We encountered spates of PartReplicationAuditJob HB alerts: https://app.honeybadger.io/projects/54415/faults/64996758

The current remediation procedures document how to handle single instances of these sorts of errors. Procedures are needed for how to investigate and remediate these errors in bulk.

jmartin-sul commented 3 years ago

a bit more context from discussion on slack:

@julianmorley:

Soooo…. I ran a check over the weekend and it turns out there’s 9,491 druid-versions that are not replicated to AWS east. @andrewjbtw @jmartin-sul. I expect the number to be similar on the other two endpoints, but more importantly is I also expect them to be for the same druid-version, so we have about 10K druids that are not recoverable right now.

Very rough spot check and they do seem to be clustered around April’20 and August’20, so almost certainly stragglers from the issues back then.

@jmartin-sul:

good to know. i have a couple queries that look for replication shortfall, that i've run a few times since we first started in on storage migrations last year, and 10k doesn't feel too far off from what i remember seeing after running those.

julianmorley commented 3 years ago

Here's the list for aws-us-east. aws-east-missing.log

julianmorley commented 3 years ago

FWIW the majority of the missed druid-versions so far seem to have occurred on 10/2/2020, with 4/29/20 and 9/18/20 being other common, but less popular, dates. There's also a smattering of other dates, but no other obvious patterns.

jmartin-sul commented 3 years ago

existing relevant documentation: https://github.com/sul-dlss/preservation_catalog/wiki/Investigating-a-druid-with-replication-errors https://github.com/sul-dlss/preservation_catalog/wiki/Cleaning-up-delivery-failures

both geared towards remediating small numbers (e.g. 1-5) of objects.

julianmorley commented 3 years ago

OK, I've finally finished the audit of cloud endpoints: this is all based on what was in PresCat on 7/8/21.

There's 9,365 druid-versions that are not on any cloud endpoint, and so are not currently being preserved, plus a few hundred more that are missing from 1 or more (but not all 3) cloud endpoints.

In an ideal world, we'd just pipe a csv of these missing druid-versions into the prescat rails console and zipmaker would do its thing, but I don't know if that would work, esp. not if PresCat thought it had already successfully transferred these druid-versions (which I think it does, otherwise wouldn't an audit show them as failed?).

sul-sdr-ibm-us-south-1-prod-missing.log sul-sdr-aws-us-west-2-archive-missing.log sul-sdr-aws-us-east-1-archive-missing.log common-missing.log

julianmorley commented 3 years ago

Using the code from https://github.com/sul-dlss/preservation_catalog/wiki/Investigating-a-druid-with-replication-errors and a random failed druid, I got this debug out:

[4] pry(main)> zip_part_debug_info(druid)
=> [["bb064yd9457", 7, 1, "aws_s3_east_1", "ok", ".zip", 1, 158043406, Wed, 25 Sep 2019 16:53:57.028171000 UTC +00:00, Wed, 14 Jul 2021 19:27:19.250895000 UTC +00:00],
 ["bb064yd9457", 7, 1, "aws_s3_west_2", "ok", ".zip", 1, 158043406, Thu, 16 Aug 2018 03:06:16.540020000 UTC +00:00, Wed, 14 Jul 2021 14:49:14.283271000 UTC +00:00],
 ["bb064yd9457", 7, 1, "ibm_us_south", "ok", ".zip", 1, 158043406, Wed, 06 Mar 2019 01:24:48.408286000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.108068000 UTC +00:00],
 ["bb064yd9457", 7, 2, "aws_s3_east_1", "ok", ".zip", 1, 34035, Wed, 25 Sep 2019 16:53:56.079124000 UTC +00:00, Wed, 14 Jul 2021 19:27:19.482165000 UTC +00:00],
 ["bb064yd9457", 7, 2, "aws_s3_west_2", "ok", ".zip", 1, 34035, Thu, 16 Aug 2018 03:06:05.695723000 UTC +00:00, Wed, 14 Jul 2021 14:49:14.434307000 UTC +00:00],
 ["bb064yd9457", 7, 2, "ibm_us_south", "ok", ".zip", 1, 34035, Wed, 06 Mar 2019 01:24:50.903194000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.278724000 UTC +00:00],
 ["bb064yd9457", 7, 3, "aws_s3_east_1", "ok", ".zip", 1, 41795, Wed, 25 Sep 2019 16:53:55.987318000 UTC +00:00, Wed, 14 Jul 2021 19:27:19.780137000 UTC +00:00],
 ["bb064yd9457", 7, 3, "aws_s3_west_2", "ok", ".zip", 1, 41795, Thu, 16 Aug 2018 03:06:05.649856000 UTC +00:00, Wed, 14 Jul 2021 14:49:14.591651000 UTC +00:00],
 ["bb064yd9457", 7, 3, "ibm_us_south", "ok", ".zip", 1, 41795, Wed, 06 Mar 2019 01:24:51.169043000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.438302000 UTC +00:00],
 ["bb064yd9457", 7, 4, "aws_s3_east_1", "ok", ".zip", 1, 46983, Wed, 18 Mar 2020 21:19:03.652470000 UTC +00:00, Wed, 14 Jul 2021 19:27:19.994165000 UTC +00:00],
 ["bb064yd9457", 7, 4, "aws_s3_west_2", "ok", ".zip", 1, 46983, Wed, 18 Mar 2020 21:19:03.643244000 UTC +00:00, Wed, 14 Jul 2021 14:49:14.721132000 UTC +00:00],
 ["bb064yd9457", 7, 4, "ibm_us_south", "ok", ".zip", 1, 46983, Wed, 18 Mar 2020 21:19:03.657291000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.603418000 UTC +00:00],
 ["bb064yd9457", 7, 6, "aws_s3_east_1", "ok", ".zip", 1, 67265, Mon, 15 Jun 2020 18:24:44.885213000 UTC +00:00, Wed, 14 Jul 2021 19:27:20.242196000 UTC +00:00],
 ["bb064yd9457", 7, 6, "aws_s3_west_2", "ok", ".zip", 1, 67265, Mon, 15 Jun 2020 18:24:44.855513000 UTC +00:00, Wed, 14 Jul 2021 14:49:14.958757000 UTC +00:00],
 ["bb064yd9457", 7, 6, "ibm_us_south", "ok", ".zip", 1, 67265, Mon, 15 Jun 2020 18:24:44.891207000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.795765000 UTC +00:00],
 ["bb064yd9457", 7, 7, "aws_s3_east_1", "ok", ".zip", 1, 71866, Wed, 01 Jul 2020 20:27:18.585915000 UTC +00:00, Wed, 14 Jul 2021 19:27:20.458724000 UTC +00:00],
 ["bb064yd9457", 7, 7, "aws_s3_west_2", "ok", ".zip", 1, 71866, Wed, 01 Jul 2020 20:27:18.578056000 UTC +00:00, Wed, 14 Jul 2021 14:49:15.120266000 UTC +00:00],
 ["bb064yd9457", 7, 7, "ibm_us_south", "ok", ".zip", 1, 71866, Wed, 01 Jul 2020 20:27:18.592453000 UTC +00:00, Wed, 14 Jul 2021 16:22:10.958016000 UTC +00:00]]

This is good! My manual audit said version 5 was missing from all endpoints, and PresCatDB agrees with me!

This is bad! PresCatDB knows there are missing druid-versions on cloud endpoints, but we have no current check that surfaces this info!

I don't know where this would fit into our existing audits (m2c maybe? or c2a?) but I think we should definitely be checking for this info and doing something about it.

EDIT: It looks like part of the problem is that ZMV thinks it made and distributed copies to the cloud. But they didn't actually get there.

[5] pry(main)> ZippedMoabVersion.by_druid('bb064yd9457').where(version: 5)
=> [#<ZippedMoabVersion:0x0000000006485548 id: 20706645, version: 5, zip_endpoint_id: 1, created_at: Wed, 29 Apr 2020 20:33:43.666173000 UTC +00:00, updated_at: Wed, 29 Apr 2020 20:33:43.666173000 UTC +00:00, preserved_object_id: 249623>,
 #<ZippedMoabVersion:0x00000000063ef2c8 id: 20706646, version: 5, zip_endpoint_id: 4, created_at: Wed, 29 Apr 2020 20:33:43.667806000 UTC +00:00, updated_at: Wed, 29 Apr 2020 20:33:43.667806000 UTC +00:00, preserved_object_id: 249623>,
 #<ZippedMoabVersion:0x00000000063ef200 id: 20706647, version: 5, zip_endpoint_id: 3, created_at: Wed, 29 Apr 2020 20:33:43.669177000 UTC +00:00, updated_at: Wed, 29 Apr 2020 20:33:43.669177000 UTC +00:00, preserved_object_id: 249623>]
[6]
jmartin-sul commented 3 years ago

so we actually check for this problem already: https://github.com/sul-dlss/preservation_catalog/blob/5df587a3476ec86742b62be11fc47796640bd6b8/app/lib/audit/catalog_to_archive.rb#L13-L16

but it appears that the logger is the only error reporting that captures that alert. these search results indicate that we specifically don't report this error code to e.g. honeybadger: https://github.com/sul-dlss/preservation_catalog/search?q=ZIP_PARTS_NOT_CREATED

that'd be easy enough to fix. i think we omitted that result code from louder reporting because we were worried about false positives from C2A running while some particular object was replicating, esp in the initial days of running the service, when we were backfilling for the 1.5 million or so objects (iirc) we had at that point. C2A was one of the more rushed (and is one of the least revisited) features.

i think C2A is the appropriate home for this check.

the thing to do with this moab and others like it is to delete the wedged zipped moab versions that have been left with no zip part children, and then just call PreservedObject#create_zipped_moab_versions! for the druid.

i would be hesitant to delete all ZMVs that have no zip parts, but if we queried and limited to those older than some threshold, i think we'd be fine. we could even query the replication queues, which are rarely very deep, to make sure any druids we catch are not in the process of being replicated. actually... that general idea might go some way towards getting us the bulk process desired by this ticket.

now that we have most of our content replicated, we might also consider turning on the flag to backfill unreplicated cloud copies whenever C2A comes across a druid with missing ZMVs (though that wasn't bb064yd9457's situation).

jmartin-sul commented 3 years ago

if we want a reusable process with some documentation or a rake task or whatever, i'd guess it's like 2 - 3 days of developer work, and however much ops work it is to delete from AWS a list of zip parts from partially replicated versions. without thinking a ton about it, i think the dev tasks are:

  1. build a list of un-replicated, partially replicated, and mis-replicated druids. besides your list on the ticket, there are some useful queries in the db readme
  2. write some code that, given a druid, will do some cleanup. like... delete zipped_moab_versions and child zip_parts on the druid if the zip_parts don't add up to at least the moab version directory size. use a find_each and feed that method a list of druids from step 1 (maybe filtered for stuff that hasn't been touched in at least a week to make sure nothing that's in flight is messed with).
  3. give ops the list of druid/version/endpoint combos that had zip parts deleted, ops deletes those zip parts from S3 buckets, if they exist.
  4. feed that list of druids to PreservedObject#create_zipped_moab_versions! to get the replication pipeline to (re-)push the missing zips to S3.

i think we want a re-usable process with some documentation or code, because even though things have gotten more reliable, i think this will come up occasionally, and it'd be great to have a well documented procedure that the FR can apply without having to load too much into (human) working memory.

julianmorley commented 3 years ago

Quick raw DB check

pres=# select count(distinct zipped_moab_version_id) from zip_parts ;
count
----------
23,145,694

pres=# select count(id) from zipped_moab_versions ;
  count
----------
 23,153,112

That's 7,412 ZMVs that do not have any zip_parts, which is unfortunately way lower than I'd hoped to find since I know there's ~27K missing druid-version zips from across all the endpoints. So we must have a bunch of records where PresCat thinks parts exist in the cloud that do not, in fact, exist.

For 2. write some code that, given a druid, will do some cleanup. Given that I can generate endpoint-specific druid-verisons that need attention, I'd love for this to be a way to batch process specific druid-version-endpoint triples. I don't particularly want to blow away and re-process the entire druid, not least because that will mean deleting known-good zips from cloud endpoints to get it to work.

jmartin-sul commented 3 years ago

That's 7,412 ZMVs that do not have any zip_parts, which is unfortunately way lower than I'd hoped to find since I know there's ~27K missing druid-version zips from across all the endpoints. So we must have a bunch of records where PresCat thinks parts exist in the cloud that do not, in fact, exist.

that's possible, but i wouldn't necessarily draw that conclusion solely from those data points. for instance, how many PreservedObjects don't have ZMVs at all? hopefully a significant portion of the 27K missing druid-versions, because that's the easiest situation to remediate (in fact, cloud audit currently checks for that, and we could try enabling the long dormant setting to auto-backfill any missing ZMV, since we've done the initial push to all endpoints manually, and don't afaik expect to add a new endpoint anytime soon).

another scenario: there is a zip part record, but audit has noticed that it was either replicated, or was erroneously recorded as such and is now listed as not_found. there are apparently 498 zip parts in this state (this and my anecdotal experience manually running cloud audit and then manually running its constituent components to get more detailed info both indicate to me that cloud audit does in fact detect the case where there's a zip part record but no actual S3 copy).

another, though this one might not show up as anomalous if your auditing doesn't incorporate size expectations or part numbering contiguousness or something: a multi-part ZMV was partially replicated, leading to some but not all zip parts being present for a ZMV. i don't think the above queries would detect that, because select count(distinct zipped_moab_version_id) from zip_parts wouldn't differentiate partially from fully replicated. audit doesn't look directly for this (yet, i ticketed it, see #1742). but it does check that all ZMVs that have parts have the same number of parts, and alerts if not, though currently there's no place to store this info in the DB (ZMV doesn't have a status field). this actually does often detect the issue, because it's unusual that the exact same number of parts fails to upload to all three endpoints, unless there's a long lasting network outage.

there are some queries to get info like this in the DB readme:

sorry, i should've included those along with the wiki links earlier.

i've been meaning to try to post results of those queries for a while but hadn't gotten around to it. will try to do that now...

For 2. write some code that, given a druid, will do some cleanup. Given that I can generate endpoint-specific druid-verisons that need attention, I'd love for this to be a way to batch process specific druid-version-endpoint triples. I don't particularly want to blow away and re-process the entire druid, not least because that will mean deleting known-good zips from cloud endpoints to get it to work.

yes, that was what i was imagining, and point 3 was meant to address that. i've amended point 3 to include the endpoint qualification. we absolutely don't want to delete anything we don't have to, for maximum archiviness (the bandwidth minimization is a nice bonus, but not nearly as important as not touching anything we don't have to, imho). i don't think we can avoid blowing away parts for partially replicated zips, because you can't just re-gen the missing parts, but we def won't be deleting known good ZMVs (either the catalog records or the actual cloud copies they track).

jmartin-sul commented 3 years ago

here's a file with the results of those queries from db/README.md: pres cat replication shortfall queries 2021-08-20.md

it finds just under 22000 problematic or missing ZMVs. so, still doesn't quite line up with your count of 27K missing druid-version-endpoint combos, but i think that's encouragingly close. i haven't had time to look at the lists you've posted so far.

also, could you post the script you're using to figure out what's missing? would be good to reconcile our approaches. even if we continue auditing both in pres cat and outside of it (not a bad idea for preservation), it'd be good to shore up each of our approaches in case one or both is missing logic it could use from the other.

julianmorley commented 3 years ago

That's cool! The full count was

   9491 sul-sdr-aws-us-east-1-archive-missing.log
    9500 sul-sdr-aws-us-west-2-archive-missing.log
    9370 sul-sdr-ibm-us-south-1-prod-missing.log
   28361 total

So, 28K. :(

The script is actually bash scripts. It's quite complex. But for this particular case, what it does is:

julianmorley commented 3 years ago

But WAIT! If you ADD my "I have NO zip parts" query to your "I have SOME zip parts" query, you get ... 28,436.

Co-incidence? I THINK NOT.

jmartin-sul commented 2 years ago

no zip parts in query form:

zmvs_with_no_zip_parts = ZippedMoabVersion.joins(:preserved_object, :zip_endpoint).left_outer_joins(:zip_parts).group(:druid, :endpoint_name, :version).having('count(zip_parts.id) = 0').order(:druid, :version, :endpoint_name).pluck(:druid, :version, :endpoint_name) ; nil

7427 results on prod as of this morning, which i believe aligns the totals that julian and i got through different means? if so, 🎉

current results attached.

see also #1750 (used that query for testing that rake task), and #1754 (in which i add the query to the readme with the other useful audit queries) pres cat replication shortfall, orphaned ZMVs, 2021-09-28.md

jmartin-sul commented 2 years ago

ah yes, and @julianmorley did specifically say before:

That's 7,412 ZMVs that do not have any zip_parts

possible the 15 ZMV discrepancy is due to accumulating a few more mis-replications in the last few weeks? 🤷

julianmorley commented 2 years ago

Checking in on this and I think we've not quite cracked it.

Using @jmartin-sul 's check for empty ZMV:

Loading production environment (Rails 6.1.4.1)
[1] pry(main)> zmvs_with_no_zip_parts = ZippedMoabVersion.joins(:preserved_object, :zip_endpoint).left_outer_joins(:zip_parts).group(:druid, :endpoint_name, :version).having('count(zip_parts.id) = 0').order(:druid, :version, :endpoint_name).pluck(:druid, :version, :endpoint_name) ; nil

=> nil

...which is good, I think, but then I spot-checked the list of missing ZMV and found those parts are still not replicated to the cloud. So we need to remediate them and get them onto the various endpoints.

EDIT: to add, I think this 'just' means identifying the druids that now have missing ZMV and running CompleteMoab.by_druid('dc156hp0190').find_each(&:create_zipped_moab_versions!) against them.

jmartin-sul commented 2 years ago

note i left in slack before winter break that i apparently forgot to leave here:

i think the result that @julianmorley got where he ran my query and got a nil result is misleading -- i tacked a ; nil onto that line of ruby so it wouldn't evaluate the ActiveRecord relation immediately, run the query, and possibly dump thousands of results to the screen. i suspect that tacking a .count onto the variable defined in that line of ruby would give the expected (30k-ish) count.

NOTE: i am totally happy to pair with any first-responder that has time to pick this up. if you don't want to read through this whole thread, i can skim it, refresh myself, find whatever the next task is for this, write a small self-contained ticket for that, and put that on the prod priorities board (so whoever's FR won't have to read through the whole thread and figure out where things are).

i think the basic next steps are

jmartin-sul commented 2 years ago

yesterday, i cleaned up a number of batches of the situation where ZMVs exist with no zip parts, using the following code from rails console on prod-02:

zmvs_with_no_zip_parts = ZippedMoabVersion.joins(:preserved_object, :zip_endpoint).left_outer_joins(:zip_parts).group(:druid, :endpoint_name, :version).having('count(zip_parts.id) = 0').order(:druid, :version, :endpoint_name).pluck(:druid, :version, :endpoint_name) ; nil
zmvs_with_no_zip_parts_first_1000 = zmvs_with_no_zip_parts.slice(0, 1000)
zmvs_with_no_zip_parts_first_1000.map { |druid, version, _endpoint_name| CatalogRemediator.prune_replication_failures(druid: druid, version: version) }
zmvs_with_no_zip_parts_first_1000.map { |druid, _version, _endpoint_name| PreservedObject.find_by(druid: druid).create_zipped_moab_versions! }

based on spot checking (by running replication audit) some druids that were cleaned up, and keeping an eye on resque console while replication jobs were running, it seems like this approach successfully remediated a couple thousand druids which were partially or wholly unreplicated.

will post log files with more detail to allow future auditing in more detail if need be.

ndushay commented 1 year ago

@justinlittman I think you have done this, or are doing it under the guise of #2081. True? If so, please close this.

ndushay commented 1 year ago

I'm closing this as I think we now have a documented process for doing bulk remediation of PartReplicationAuditJob errors. If that's wrong, @justinlittman, please either address the problem or write a ticket for what's left.