sul-dlss / preservation2017

Story repo for preservation core work done summer/fall 2017
0 stars 0 forks source link

Story: Create an Object Inventory Store #2

Open LynnMcRae opened 7 years ago

LynnMcRae commented 7 years ago

This is an operational metadata store, part of the overall Preservation Core Catalog. It has an entry per object (per druid) that exists in Preservation Core. There must be a one-to-one correspondence with items in the Moab object store. It will contain information about the current state of the object in order to provide information about objects in Preservation Core to authorized users.

Desired information (wip)

Part of the overall audit strategy should be to assure the completeness and accuracy of this metadata. Note that while the inventory is a reflection of what's actually out there, and in principle could be reconstructed from the Moab directories and Archive Endpoints, the idea is to maintain this as an active component in identifying and managing Preservation Core, acting as a form of double-entry bookkeeping when auditing for irregularities.

julianmorley commented 7 years ago

One thought I had for how to handle fixity results in the PCC is to have a 'last checked' list of druid + timestamp. If If a druid is not in the list, or if the timestamp is too old (a configurable TTL) it's eligible for a new fixity check. So when the audit process increments through the list of online copies in the inventory, one of it's tasks would be to check against this list to see if the online copy is due for a fixity check. Fixity checking of archive copies would be trickier.

LynnMcRae commented 7 years ago

I'll do a separate story on the overall management and scheduling of the audit processes -- frequency, regularity, based on activity/staleness, etc. The Inventory could carry some of the "last checked" information desired, if that is part of your suggestion.

julianmorley commented 7 years ago

Some of my original questions for what this catalog should be able to answer:

ndushay commented 7 years ago

potential consumers, per discussion

julianmorley commented 7 years ago

And the main consumer would be the audit process.

LynnMcRae commented 7 years ago

The central catalog should be able to answer these questions:

ndushay commented 7 years ago

This reflects current state only. The provenance db will have the history. It needs to be an accurate inventory (and it will ultimately need to be audited for accuracy, perhaps on an ongoing basis?)

If the PCC data was lost, it would be rebuilt from a live inventory (as opposed to if the Prov. data (history) was lost, it would be restored from backup).

LynnMcRae commented 7 years ago

Possible tasks:

LynnMcRae commented 7 years ago

Of possible relevance to selecting a database/persistence technology. PC is heavily firewalled with very minimal exposure to the outside world -- it pulls objects from DOR (no one pushes content in), it provides SDR Web Services to DOR/Argo only. We should have something that is wholly contained within the Preservation Core's domain of physical servers. The mention of solr brought this to mind since we'd be wanting to leverage the solr cloud, possibly ruling that out.

julianmorley commented 7 years ago

Some more datastore notes:

julianmorley commented 7 years ago

The original design called for many storage servers, each with their own selection of Moab objects to protect and catalog, with a local catalog used for local audit/archive/recovery activities that also reported upstream to a a central PCC that collated data from all the nodes for consumption by DOR/external services.

Think of it in terms of each current sdr-services mount existing on it's own physical storage server, with it's own series of SDR-PC processes for inventory, audit, archive and recovery and it's own local shard of the PCC. This would enable horizontal scaling of our storage and allow us to effectively audit a truly large number of druids in parallel.

julianmorley commented 7 years ago

Two diagrams: The first gives an overall view of the object lifecycle and main components. Some peripheral items (PD, APIs elsewhere) are explicitly not shown to hopefully make this nice and clear. Preservation Core Simple Object Lifecycle.pdf The second diagram is a subset of the first, and shows the scope of this first story. Preservation Core Simple Object Lifecycle - Story 1.pdf

The Inventory process populates the preservation core catalog from the Moab Object Store. The Audit process iterates through items in the preservation core catalog and reconciles them. If it finds problems, Audit tells the Recovery or Archive processes to do work.

julianmorley commented 7 years ago

Per today's Naming Of The Things, this is now the Object Inventory Store, one of the three components of the Preservation Core Catalog.

ndushay commented 7 years ago

Notes from story time 8/31/17 (per @SaravShah )