sul-dlss / preservation2017

Story repo for preservation core work done summer/fall 2017
0 stars 0 forks source link

Story: Create a Trusted Checksum Repository #3

Open LynnMcRae opened 7 years ago

LynnMcRae commented 7 years ago

This is an operational metadata store, part of the overall Preservation Core Catalog. Its purpose is to securely store sufficient checksum information about all objects in Preservation Core to support a reliable content audit for unexpectedly altered or corrupted content and to effect recovery from other copies in Preservation Core.

The term "Trusted" in the name therefore has semantic implications, as this lies at the heart of maintaining the integrity of preservation objects over time. This may mean isolation from the rest of the PC metadata, with greater protections and stricter access rules.

Current policy requires an MD5, SHA1 and SHA256 checksum set whenever checksums are generated. Revisiting this policy is out of scope for this effort.

Online Moab objects: An Online Moab object is a specific hierarchical directory schema. While it can be zipped or bagged for transport, the object itself is not containerized and therefore has no object-wide checksum. Instead the TCR should contain checksums for every version of the manifestInventory.xml file that is part of the Moab object. [need to explore and this end-to-end ... is this the solution for lack of container checksums?]

Archive Moab objects When an archive copy of a Moab object is made (i.e., it's made into a BagIt archive or just a plain old zip file of the entire Moab directory), the resulting file name should be stored, along with the checksums for it.

LynnMcRae commented 7 years ago

Is there any role for file-specific checksum information as stored in the signatureManifest.xml file? or more broadly, any consideration for file-level vs whole-object recovery?

julianmorley commented 7 years ago

The TCR should be written to only by the Archive process. When Archive is processing a Moab, it performs a fixity check on the Moab. At the start of that fixity check, the Archive process queries the TCR for the checksums of all versions of that Moab's manifestInventory.xml. If there is no record in the TCR for the Moab's manifestInventory.xml checksums, the Archive process generates them and creates a new record in the TCR. The Archive process can only CREATE and READ records in the TCR. It cannot UPDATE or DELETE records. When the Archive process has bagged a Moab for replication, it generates checksums of the final file and creates the appropriate records in the TCR.

julianmorley commented 7 years ago

For storing this data in the TCR, consider one table per checksum type and composite primary keys.

e.g.: table manifestInventoryMd5 columns druid (char 11), version (char 5), checksum (char16) composite PK on druid, version

table archiveObjectMd5 columns druid (char 11), version (char 5), checksum(char16) composite PK on druid, version

table archiveObjectFilename columns druid (char 11), version (char 5), filename (varchar? char 42?) composite PK on druid, version

Having one table per checksum/file combination allows us better scalability, and the option to deprecate checksum types without impacting other records.

We should explicitly limit field size for known column types (checksums, druids) since this is part of object verification. There can never be a 12 character druid or a 43 character MD5 checksum.

ndushay commented 7 years ago

TCR will have to contain:

Audit process will:

Archive process will:

Ingest process may (perhaps by triggering the archive process):

ndushay commented 7 years ago

from #27:

How much fixity checking do we actually need? How much for Moab object vs. archive object?

Can we trust the internal fixity of a Moab object? that is, if the moab object's individual file fixity is good >... can we just trust the overall checksum of the whole object without having to further go after individual files for verification?

Does TCR need to store individual file checksums in Moab? Or just the single checksum series for the entire object?

ndushay commented 7 years ago

What is the point of the individual checksums in the Moab object if we never use them?

ndushay commented 7 years ago

What will the recovery process be if a checksum doesn't validate?

If we can define the recovery process, will that inform what we need to know from the TCR.

ndushay commented 7 years ago

Online Moab: only need to verify checksums for the manifestInventory file against the TCR.

ndushay commented 7 years ago

from sul-dlss/preservation_core_catalog/issues/60

Eventually we will want a "Trusted Checksum Repository" (TCR). But as a start, we can defer thinking about TCR, and just do fixity checks of the online moabs, using the checksums already contained in the moabs being checked.

A "Trusted Checksum Repository" is a place that can store checksums for use in fixity checking of both online moabs and archival copies of moabs. It should allow retrieval of checksums by the audit process (the audit process being whatever code actually executes the fixity check and records the result in the Object Inventory Store and/or Provenance). The TCR should safeguard against corruption and data loss, and should be designed such that a buggy or malicious actor with the ability to damage online moabs or the OIS has little or no avenue for damaging the TCR, so that it can reliably be used for integrity checking and recovery in the event of damage to moabs.

I only note the general TCR description here, because I couldn't find a ticket that already articulates that. Feel free to link this ticket to that ticket if such a ticket already exists.

ndushay commented 7 years ago

from sul-dlss/preservation_core_catalog#60

see also: https://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/iso16363

from the story docs:

Trusted Checksum Repository: