Open LynnMcRae opened 7 years ago
One thought I had for how to handle fixity results in the PCC is to have a 'last checked' list of druid + timestamp. If If a druid is not in the list, or if the timestamp is too old (a configurable TTL) it's eligible for a new fixity check. So when the audit process increments through the list of online copies in the inventory, one of it's tasks would be to check against this list to see if the online copy is due for a fixity check. Fixity checking of archive copies would be trickier.
I'll do a separate story on the overall management and scheduling of the audit processes -- frequency, regularity, based on activity/staleness, etc. The Inventory could carry some of the "last checked" information desired, if that is part of your suggestion.
Some of my original questions for what this catalog should be able to answer:
potential consumers, per discussion
DOR - consumer by the druid: present info to argo users and possibly to argo index (for creating a way to facet on status in prez core)
Repository manager (Ben) may want to query for the state of the objects, or the number of objects in a particular state
And the main consumer would be the audit process.
The central catalog should be able to answer these questions:
This reflects current state only. The provenance db will have the history. It needs to be an accurate inventory (and it will ultimately need to be audited for accuracy, perhaps on an ongoing basis?)
If the PCC data was lost, it would be rebuilt from a live inventory (as opposed to if the Prov. data (history) was lost, it would be restored from backup).
Possible tasks:
Of possible relevance to selecting a database/persistence technology. PC is heavily firewalled with very minimal exposure to the outside world -- it pulls objects from DOR (no one pushes content in), it provides SDR Web Services to DOR/Argo only. We should have something that is wholly contained within the Preservation Core's domain of physical servers. The mention of solr brought this to mind since we'd be wanting to leverage the solr cloud, possibly ruling that out.
Some more datastore notes:
The original design called for many storage servers, each with their own selection of Moab objects to protect and catalog, with a local catalog used for local audit/archive/recovery activities that also reported upstream to a a central PCC that collated data from all the nodes for consumption by DOR/external services.
Think of it in terms of each current sdr-services mount existing on it's own physical storage server, with it's own series of SDR-PC processes for inventory, audit, archive and recovery and it's own local shard of the PCC. This would enable horizontal scaling of our storage and allow us to effectively audit a truly large number of druids in parallel.
Two diagrams: The first gives an overall view of the object lifecycle and main components. Some peripheral items (PD, APIs elsewhere) are explicitly not shown to hopefully make this nice and clear. Preservation Core Simple Object Lifecycle.pdf The second diagram is a subset of the first, and shows the scope of this first story. Preservation Core Simple Object Lifecycle - Story 1.pdf
The Inventory process populates the preservation core catalog from the Moab Object Store. The Audit process iterates through items in the preservation core catalog and reconciles them. If it finds problems, Audit tells the Recovery or Archive processes to do work.
Per today's Naming Of The Things, this is now the Object Inventory Store, one of the three components of the Preservation Core Catalog.
Notes from story time 8/31/17 (per @SaravShah )
This is an operational metadata store, part of the overall Preservation Core Catalog. It has an entry per object (per druid) that exists in Preservation Core. There must be a one-to-one correspondence with items in the Moab object store. It will contain information about the current state of the object in order to provide information about objects in Preservation Core to authorized users.
Desired information (wip)
Part of the overall audit strategy should be to assure the completeness and accuracy of this metadata. Note that while the inventory is a reflection of what's actually out there, and in principle could be reconstructed from the Moab directories and Archive Endpoints, the idea is to maintain this as an active component in identifying and managing Preservation Core, acting as a form of double-entry bookkeeping when auditing for irregularities.