Story: Create an Object Inventory Store

LynnMcRae commented 7 years ago

This is an operational metadata store, part of the overall Preservation Core Catalog. It has an entry per object (per druid) that exists in Preservation Core. There must be a one-to-one correspondence with items in the Moab object store. It will contain information about the current state of the object in order to provide information about objects in Preservation Core to authorized users.

Desired information (wip)

druid
current PC version number
ingest status and date/time
replication type, status and datetime per archive copy
current/active audit findings (to flag errors)
overall status (needs thought, but intended as a source of a useful Argo facet)

Part of the overall audit strategy should be to assure the completeness and accuracy of this metadata. Note that while the inventory is a reflection of what's actually out there, and in principle could be reconstructed from the Moab directories and Archive Endpoints, the idea is to maintain this as an active component in identifying and managing Preservation Core, acting as a form of double-entry bookkeeping when auditing for irregularities.

julianmorley commented 7 years ago

One thought I had for how to handle fixity results in the PCC is to have a 'last checked' list of druid + timestamp. If If a druid is not in the list, or if the timestamp is too old (a configurable TTL) it's eligible for a new fixity check. So when the audit process increments through the list of online copies in the inventory, one of it's tasks would be to check against this list to see if the online copy is due for a fixity check. Fixity checking of archive copies would be trickier.

LynnMcRae commented 7 years ago

I'll do a separate story on the overall management and scheduling of the audit processes -- frequency, regularity, based on activity/staleness, etc. The Inventory could carry some of the "last checked" information desired, if that is part of your suggestion.

julianmorley commented 7 years ago

Some of my original questions for what this catalog should be able to answer:

Does druid ab123cd4567 exist on this storage device?
What version of druid ab123cd4567 is on this storage device?
What are all the druids on this storage device?
How many druids are on this storage device?
What is the object size of druid ab123cd4567 ? (summation of data contained within the various xml files, would be useful for future druid migration/placement strategies)
When was the last successful fixity check of druid ab123cd4567? (optional: when was the last successful fixity check of druid ab123cd4567 version 0007 ?)
Should cache the results of requests to a cloud provider to verify the existence of a specific archive object, with a configurable TTL (answers the question, "What archive copies of this Moab exist?")
Should cache the results of local fixity checks of a specific version of a druid, with a configurable TTL (answers the question, "Has this online Moab been verified within a specified time period?")

ndushay commented 7 years ago

potential consumers, per discussion

DOR - consumer by the druid: present info to argo users and possibly to argo index (for creating a way to facet on status in prez core)
Repository manager (Ben) may want to query for the state of the objects, or the number of objects in a particular state

julianmorley commented 7 years ago

And the main consumer would be the audit process.

LynnMcRae commented 7 years ago

The central catalog should be able to answer these questions:

List & count all druids in preservation core
List & count all druids on a specified node
For a given druid on a specified node, list the most recent version available.
For a given druid, list the locations (nodes) of all online copies
For a given druid, list the locations of all archive copies (OK if this is cache - in fact, it probably should be)
For a given druid, list recent fixity check info (OK if this is a cache)

ndushay commented 7 years ago

This reflects current state only. The provenance db will have the history. It needs to be an accurate inventory (and it will ultimately need to be audited for accuracy, perhaps on an ongoing basis?)

If the PCC data was lost, it would be rebuilt from a live inventory (as opposed to if the Prov. data (history) was lost, it would be restored from backup).

LynnMcRae commented 7 years ago

Possible tasks:

[ ] Choose a database/datastore technology
[ ] Complete schema definition (desired fields and purpose)
[ ] Access to Moab objects (mount prod in read-only vs populate directory with test objects)
[ ] Populate Inventory from Moab (direct update, not the future Service API)
[ ] Create initial API to write new or updated inventory entries

LynnMcRae commented 7 years ago

Of possible relevance to selecting a database/persistence technology. PC is heavily firewalled with very minimal exposure to the outside world -- it pulls objects from DOR (no one pushes content in), it provides SDR Web Services to DOR/Argo only. We should have something that is wholly contained within the Preservation Core's domain of physical servers. The mention of solr brought this to mind since we'd be wanting to leverage the solr cloud, possibly ruling that out.

julianmorley commented 7 years ago

Some more datastore notes:

consider likely access patterns; the audit process will regularly be incrementing over a significant portion of the PCC's expected dataset, so a traditional caching approach may not work well (a LRU cache may well see entries expire before they are called again).
Likewise, low latency read performance is desirable. Many calls across a network to satisfy a single audit transaction could impose an unacceptable per-tx time.
Finally, consider scalability and record size. We have 1.2 million druids to catalog, and should plan on this system scaling to handle at least 10 million druids over it's lifetime without a significant refactor. Designing to handle 100 million druids would be ideal.

julianmorley commented 7 years ago

The original design called for many storage servers, each with their own selection of Moab objects to protect and catalog, with a local catalog used for local audit/archive/recovery activities that also reported upstream to a a central PCC that collated data from all the nodes for consumption by DOR/external services.

Think of it in terms of each current sdr-services mount existing on it's own physical storage server, with it's own series of SDR-PC processes for inventory, audit, archive and recovery and it's own local shard of the PCC. This would enable horizontal scaling of our storage and allow us to effectively audit a truly large number of druids in parallel.

julianmorley commented 7 years ago

Two diagrams: The first gives an overall view of the object lifecycle and main components. Some peripheral items (PD, APIs elsewhere) are explicitly not shown to hopefully make this nice and clear. Preservation Core Simple Object Lifecycle.pdf The second diagram is a subset of the first, and shows the scope of this first story. Preservation Core Simple Object Lifecycle - Story 1.pdf

The Inventory process populates the preservation core catalog from the Moab Object Store. The Audit process iterates through items in the preservation core catalog and reconciles them. If it finds problems, Audit tells the Recovery or Archive processes to do work.

julianmorley commented 7 years ago

Per today's Naming Of The Things, this is now the Object Inventory Store, one of the three components of the Preservation Core Catalog.

ndushay commented 7 years ago

Notes from story time 8/31/17 (per @SaravShah )

Every object that exists is recorded here vs the Moab object store (looking for any discrepancies)
What is the moab, where do we find it when did we find it and what version it.
- Only four pieces of info
- Is ois where we should write code to check the drives etc?
- Inventory reads from storage and puts in ois
- Audit reads from ois and does stuff with it
- Ingest write the initial entry to the object inventory store
- I found this version do you have it? Constantly doing this
- Inventory process doesn’t care what is in OIS
- Put what it finds into ois
  - Whether that is new or an update
  - If it finds a new record or update and wasn’t told about that by ingest -> Then it should still add in OIS and still flag saying I found a loaner
We will not be restoring from back up, be able to recreate by scanning object stores and config stores
Once it is populated TABLED
- Script that will do a query what are the things that have been looked at the least recently. Periodic trolling
If current version is updated we might not have to report it
Ingest is successful -> OIS
Inventory Process -> OIS
PCC interacts with OIS, PCC has api lets you CU, and also runs a process that goes over Moab directories and sues other API to ask whether it is in ois and put in and report correctly

sul-dlss / preservation2017

Story: Create an Object Inventory Store #2