sul-dlss / preservation_catalog

Rails application to track, audit and replicate archival artifacts associated with SDR objects.
https://sul-dlss.github.io/preservation_catalog/
Other
2 stars 2 forks source link

investigate zip-making/delivery failures from ActiveJob misconfiguration #1291

Closed jmartin-sul closed 1 year ago

jmartin-sul commented 4 years ago

the rails 6 upgrade (#1270) inadvertently changed pres cat's queue adapter from :resque to rails' default (:async) adapter.

this lead to two problems:

see https://github.com/sul-dlss/preservation_catalog/blob/master/app/jobs/README.md for an illustration of the replication pipeline.

we should just find all of the PreservedObjects/CompleteMoabs which aren't yet properly replicated, and trigger zip making for them. @julianmorley has an audit script that hits S3 to do this, but we could likely also identify such objects by querying pres cat.

jmartin-sul commented 4 years ago

so far i have looked for:

SELECT *
FROM "complete_moabs"
WHERE NOT EXISTS (
  SELECT 1
  FROM "zipped_moab_versions"
  WHERE (complete_moabs.version = zipped_moab_versions.version)
  LIMIT 1
)
# LIMIT 1

# no results as of 2019-12-20:
#  makes sense, because AR hooks automatically call `create_zipped_moab_versions!` on create and update of CompleteMoab.
#  it's an `after_create` hook on ZMV that triggers replication work
SELECT *
FROM "zipped_moab_versions"
WHERE NOT EXISTS (
  SELECT 1
  FROM "zip_parts"
  WHERE zip_parts.id = zipped_moab_versions.id
  LIMIT 1
)
# LIMIT 1

# lots of results as of 2019-12-20
SELECT COUNT(*) FROM "zip_parts" WHERE "zip_parts"."status" = $1  [["status", 1]]

# 5911 results as of 2019-12-20

once we're back from break, i'll plan to finish up this ticket by:

i can also try to write ActiveRecord versions of the first two queries, if people would find that useful. otherwise, i'll likely dump the IDs they generate to a file, and work off of that for the remediation steps i've listed above.

julianmorley commented 4 years ago

It's 5907 now ... but also as far as I can tell, there's nowhere near that many zip parts missing from all the s3 endpoints. In fact, it's pretty much zero. So I think there's an issue here with PresCat not correctly detecting valid zip parts on endpoints (or holding onto bad data).

Edit: Yeah, that's exactly what's happening. Spot-checking some of these 'status' = '1' zip_parts and I'm seeing them all on endpoints. Looks like zip_parts either didn't get correctly updated after a successful upload, or an audit that finds/fixes these parts didn't work right.

ndushay commented 1 year ago

This ticket is now 3 years old.

The new dashboard code plus the CatalogToArchive audit code seems adequate to find replication errors. I'm closing this ticket; if that's the wrong thing to do, please re-open.