investigate zip-making/delivery failures from ActiveJob misconfiguration

jmartin-sul commented 4 years ago

the rails 6 upgrade (#1270) inadvertently changed pres cat's queue adapter from :resque to rails' default (:async) adapter.

this lead to two problems:

instead of resque pool managing the queue, which would result in workers on prod machines -02 through -04 picking up jobs, the jobs were worked by an "an in-process thread pool" on the boxes on which they were enqueued. in practice, since prod-01 is the web interface for the robots, this means that prod-01 attempted to do all the work that gets triggered by the robots telling pres cat that a DOR object has been accessioned or versioned.
- since prod-01 doesn't have write permission to the temp space where zip files are created (to be sent to cloud endpoints by delivery workers), zip making failed, and objects that were created or versioned while that bug was live were not successfully archived.
had zip creation been successful, the workers that would send the zip files to cloud endpoints would not have had the appropriate credentials to find and write to their S3 buckets, because said credentials are provisioned by feeding env vars to the resque pool start commands (but jobs were not being worked by workers that were managed by resque pool).
- in practice, this may not have been an issue, because of the problem making zips. if zips couldn't be made, subsequent workers in the pipeline would have no work to pick up.

see https://github.com/sul-dlss/preservation_catalog/blob/master/app/jobs/README.md for an illustration of the replication pipeline.

we should just find all of the PreservedObjects/CompleteMoabs which aren't yet properly replicated, and trigger zip making for them. @julianmorley has an audit script that hits S3 to do this, but we could likely also identify such objects by querying pres cat.

jmartin-sul commented 4 years ago

so far i have looked for:

complete moabs without zipped moab versions

# ended up being faster for me to write this as a plain sql query, because i was pressed for time, and i'm less facile w/ AR than plain SQL

SELECT *
FROM "complete_moabs"
WHERE NOT EXISTS (
  SELECT 1
  FROM "zipped_moab_versions"
  WHERE (complete_moabs.version = zipped_moab_versions.version)
  LIMIT 1
)
# LIMIT 1

# no results as of 2019-12-20:
#  makes sense, because AR hooks automatically call `create_zipped_moab_versions!` on create and update of CompleteMoab.
#  it's an `after_create` hook on ZMV that triggers replication work

zipped moab versions without zip parts

# ended up being faster for me to write this as a plain sql query, because i was pressed for time, and i'm less facile w/ AR than plain SQL

SELECT *
FROM "zipped_moab_versions"
WHERE NOT EXISTS (
  SELECT 1
  FROM "zip_parts"
  WHERE zip_parts.id = zipped_moab_versions.id
  LIMIT 1
)
# LIMIT 1

# lots of results as of 2019-12-20

zip parts that are unreplicated
```
ZipPart.unreplicated.count
```

SELECT COUNT(*) FROM "zip_parts" WHERE "zip_parts"."status" = $1  [["status", 1]]

# 5911 results as of 2019-12-20

once we're back from break, i'll plan to finish up this ticket by:

gettting someone to code review my queries to make sure they're sensible
identifying druids associated with the second two queries
manually pushing those druids through replication
triggering C2A for the entirety of the catalog, so that it can run its more in-depth checks, in which it looks for part count mismatches, pings the S3 buckets to check the presence of supposedly archived parts, etc.

i can also try to write ActiveRecord versions of the first two queries, if people would find that useful. otherwise, i'll likely dump the IDs they generate to a file, and work off of that for the remediation steps i've listed above.

julianmorley commented 4 years ago

It's 5907 now ... but also as far as I can tell, there's nowhere near that many zip parts missing from all the s3 endpoints. In fact, it's pretty much zero. So I think there's an issue here with PresCat not correctly detecting valid zip parts on endpoints (or holding onto bad data).

Edit: Yeah, that's exactly what's happening. Spot-checking some of these 'status' = '1' zip_parts and I'm seeing them all on endpoints. Looks like zip_parts either didn't get correctly updated after a successful upload, or an audit that finds/fixes these parts didn't work right.

ndushay commented 1 year ago

This ticket is now 3 years old.

The new dashboard code plus the CatalogToArchive audit code seems adequate to find replication errors. I'm closing this ticket; if that's the wrong thing to do, please re-open.

sul-dlss / preservation_catalog

investigate zip-making/delivery failures from ActiveJob misconfiguration #1291