sul-dlss / preservation_catalog

Rails application to track, audit and replicate archival artifacts associated with SDR objects.
https://sul-dlss.github.io/preservation_catalog/
Other
2 stars 2 forks source link

Create Process to Clear Audit Errors [aids storage migration] #1325

Closed ndushay closed 4 years ago

ndushay commented 4 years ago

(perhaps this will not be not needed as part of 2020 storage migration MVP?)

possibly this should spin off separate tickets for diff types of errors; possibly we should make it easy to do analysis or repairs with rake tasks or whatever for common error types. (e.g. version problems, or checksums (e.g. compute it 3 times, if all 3 agree, then use the new value?), etc.)

If the process to fix requires a human, can we document the process needed? If the process to fix can be automated, can we write the code, or at least make a ticket to write the code?

not clear if needed as part of migration

incestuous relationship with #1324?

jmartin-sul commented 4 years ago

self-assigning since this feels like it goes with #1324's analysis work, which is in progress.

jmartin-sul commented 4 years ago

possibly this should spin off separate tickets for diff types of errors

i think i'm going to do that, and close this one, because...

possibly we should make it easy to do analysis or repairs with rake tasks or whatever for common error types.

Since we expect checksum errors to be rare and unique relative to the size of the whole catalog, I think we'll need to remediate a number of different types of checksum errors to see if common patterns emerge, and if any of those are amenable to automation. Until then (as discussed when clarifying some of this migration work last week w/ @ndushay), I think we'll start by documenting what we do to remediate the checksum errors we have, and to capture formal docs outside of issues when we expect a procedure to be repeated in the future (this might happen e.g. if we create valid stub moabs to address decommissioned objects as discussed in #1192).

As a middle ground between "remediation that we can run on demand" and "no automation", i think it's likely that some remediations for e.g. hundreds of objects will have to be scripted to some degree as a one-off (see discussion for some of the cases on #1324).

(e.g. version problems, or checksums (e.g. compute it 3 times, if all 3 agree, then use the new value?), etc.)

Unless I'm missing something, I'd opposed to that approach to resolving checksum errors once the object is in a moab that we assume to be a valid preserved copy matching what was accessioned into DOR. I think this would be a reasonable approach when initially calculating a trustworthy checksum, as when creating a bagit bag or a moab (and we did just that when we suspected that common-accessioning was occasionally producing bad checksums on content it was ingesting late last year). But once the object has been preserved, I think we should start from the perspective of assuming that checksum mismatches indicate content corruption, and human judgement and maybe some file system/log/issue/etc archaeology should determine how we proceed with fixing the object. Otherwise the checksum strikes me as a formality instead of a meaningful safeguard. Until we have a number of remediations like that under our belt, I'd be super hesitant to automate remediation of a checksum error that didn't already fit a very limited and well understood set of constraints (e.g. there may be 500 or so moabs that had extraneous versionMetadata.xml files added after signature catalog and other manifests were generated -- we may just be able to delete those extraneous late adds, which we'd likely automate to some degree).

If the process to fix requires a human, can we document the process needed?

That's definitely the plan!