Open jmartin-sul opened 6 years ago
@edsu also observes that there's AWS facility for getting checksums of a thing that's already stored, which might be another way to address this: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
also, something something fargate task that checksums within AWS infra and doesn't incur egress charges.
either way, good to keep on the radar, but likely out of scope for the 2022 maintenance work, which is more about making what's there more maintainable.
currently, our replication audit code compares the checksum we have stored for an archive part with the checksum the S3 provider stores in their metadata for the zip part (see
PreservationCatalog::S3::Audit#compare_checksum_metadata
).however, the checksum stored in AWS metadata is just the one we computed and provided to them, so we're only checking to see that the metadata hasn't drifted between the two sources. this check is cheap to do, since we're already reaching out to the AWS to see if the archived part is still available from the cloud as expected. but it's also not a super-meaningful check to have pass.
more meaningful would be random spot-checks of archive contents for fixity. that is, randomly pull down archived copies every so often. make sure the checksums we recompute for the retrieved parts match the checksums we have stored, and that the internal checksums all match the content in the Moab when the zip parts are put back together and re-inflated. we don't want to do that for every zip during the course of regular replication auditing, because that'd be expensive, and overkill.
but some occasional retrieval of content and re-computation of checksums would provide extra peace of mind that our replication strategy is working and that the cloud archives will be usable if needed.