andrewjbtw commented 4 years ago

This is a proposal to improve SDR's generation of technical metadata to make it:

faster
more useful
more accurate
queryable
more updateable

Background

SDR currently generates technical metadata by running JHOVE during accessioning and saving whatever information JHOVE outputs for the whole group of files aggregated together. This output is stored in both a Fedora datastream and an XML file, and the XML file is stored in the moab for that object. The technicalMetadata datastream can be accessed via the Argo interface but it is not indexed for search or faceting, and as far as we know there are no users who use it and no use cases for using it in its current state.

Over the years that SDR has been in operation, this approach to technical metadata generation and storage has run into the following issues:

Out of memory errors during accessioning (see #311)

JHOVE output can be extrememly verbose, and when run on objects with many files (like books with thousands of pages) it often fails because it exceeds the memory available on the common-accessioning robots

Technical metadata takes a long time to generate

this step is consistently one of the longest during accessioning
we do not multi-thread technical metadata generation
other tools are likely faster than JHOVE and would achieve the same goals

Technical metadata is kept at the object level rather than the file level

this complicates working with objects at an individual file level
aggregating the output is probably one of the reasons the memory usage can be so high

We are storing JHOVE output regardless of what it contains

we do not use a data model where we make an effort to generate the same metadata for the same types of files
JHOVE output for different instances of the same file formats can be quite different depending on the specifics of the file (whether it conforms to a spec, what version it is of the format, etc.)

We do not update technical metadata as JHOVE improves and bugs are fixed

over the years, JHOVE has had bugs that led it to misidentify certain files
we do not systematically update our technical metadata as tools improve
updating technical metadata is difficult because it means reaccessioning in order to update the moab

It's difficult to access technical metadata in any programmatic way

we do not index our technical metadata for search or faceting
if we did index it, we would probably find problems with inconsistent data

A proposal

We should take a new approach to technical metadata generation and storage rather than attempt to patch the current system. I propose that we:

Run a file format identification tool on files before running metadata extraction

I think we should investigate Siegfried, which attempts to ID files based on format signatures from multiple sources: https://www.itforarchivists.com/siegfried/

Based on the file format ID, select an appropriate tool to extract metadata from that file

tools are TBD but it could be something like:
- for still image formats, use exiftool
- for A/V formats, use mediainfo
- for PDFs, use Apache Tika
- for office documents, use Apache Tika
- etc.

For specific sets of known formats, choose specific fields that we know we want and are likely to be present

examples would be along these lines
- images: dimensions, color space, camera settings, etc.
- moving images: codec, container, pixel format, duration, resolution, number of tracks, etc.

For formats not on the known list, store a minimal amount of metadata, mainly file system metadata

filename, file size, file dates (created, modified), format ID

Store this metadata in a structure that makes it easy to access individual file-level information
Store the metadata somewhere that can be more easily queried and updated

this could be its own technical metadata service
we should at least be able to query an index of technical metadata
we can continue storing technical metadata in the moab if we want but acknowledge that moab storage for technical metadata is more of a last resort/disaster planning use case

implement a process for updating technical metadata for files already in preservation storage

this could be modeled off of the preservation audit process
it would allow for updates to incorporate
- changes to our data model
- upgrades to file format and technical metadata extraction tools
- improvements in file format identification signatures (this is an area of ongoing work)
note that updates cannot be applied easily to technical metadata within a moab because then versioning/reaccessioning would be involved

Possible steps forward

Fully implementing a new system for generating and storing technical metadata would be a significant undertaking. At the same time, we need to move forward with making changes because the existing process is not meeting our needs.

One possibility is to split the steps above into different stages.

Stage 1

This could be to implement a new way to generate technical metadata but continue to send it to the moab. This would involve:

New file format identification
Running format-appropriate tools to generate metadata
Using a minimal data model (covering a small number of identified file types) and storing that data in a database for technical metadata
No longer sending technical metadata to the moab
Indexing technical metadata for Argo so at least there's some access point for that data

Stage 2

This would be the steps of Stage 1, plus:

Creating a technical metadata service that's queryable on its own

Stage 3

Stages 1 and 2, plus:

Implementing a process for regenerating technical metadata from copies in preservation
and using that to update the technical metadata service periodically
Making sure to update the technical metadata for all of the druids that were accessioned prior to the new method, and which still have only JHOVE metadata in their Moabs.

Possible long term needs beyond stage 3

In a meeting to discuss this proposal on 2019-02-11, Tom brought up the question of storing technical metadata in the archival package (i.e. inside the Moab) in the following situations:

Export to another system (like DPN, if such a system arises again)
Export to cold storage

While there are no immediate plans for either of these cases, we should keep in mind there will be possible use cases for exporting technical metadata from the database and associating it with individual objects. At the meeting we came to the conclusion that this would be better done as part of the export process and not prospectively before such a process has begun.

Risks and assumptions

There is a real possibility that even getting to stage 1 requires more work than we anticipate. We may find:

Siegfried (or another format ID tool) is not fast enough or accurate enough
Scalability is still a problem at the metadata extraction level
It could be difficult to come to a consensus on the data model
There might be so much metadata to index it slows everything else down

For stage 2, it could turn out that:

creating a full-fledged service for technical metadata starts to look like it's own separate application
the user interface implications of having a queryable service could be larger than anticipated

For stage 3:

it could be expensive to reprocess technical metadata even on local files
a wholesale move to the cloud could render files inaccessible for reprocessing
other storage decisions could complicate metadata re-generation, such as encryption

There are also the general risks and constraints of adding to the work on the infrastructure team portfolio, given how much is already on the schedule. However, technical metadata does fall within the scope of SDR evolution and maintenance, and the existing bottlenecks with technical metadata are a real problem that must be overcome for Google Books and other projects already in progress.

ndushay commented 4 years ago

how about phase 1 - we start with the new approach for google books only (if they have unique format signature ... if not, start with new approach for 'book' format?)

jcoyne commented 4 years ago

@ndushay I don't want to have to maintain 2 ways to do anything. We already have too much complexity, so we must be consistent.

ndushay commented 4 years ago

@jcoyne I was thinking only about "how do we get from where we are to the new way" and "how does this help googlebooks" but I defer to your judgement.

mjgiarlo commented 4 years ago

Regarding:

For specific sets of known formats, choose specific fields that we know we want and are likely to be present examples would be along these lines

images: dimensions, color space, camera settings, etc. moving images: codec, container, pixel format, duration, resolution, number of tracks, etc.

For formats not on the known list, store a minimal amount of metadata, mainly file system metadata

filename, file size, file dates (created, modified), format ID

I want to share the following schemas of terms used in the Samvera community. These are fairly mature, having been used by a wide swath of the community over the past 7-8 years:

Base schema (for all files regardless of format identification)
Image schema
Document schema
Audio schema
Video schema

I'm not suggesting we need to go with these, but they may be worth reviewing when working on this.

mjgiarlo commented 4 years ago

I wonder if the mapping work described here:

Based on the file format ID, select an appropriate tool to extract metadata from that file

tools are TBD but it could be something like: for still image formats, use exiftool for A/V formats, use mediainfo for PDFs, use Apache Tika for office documents, use Apache Tika

has been tackled anywhere out in the open? For instance, might Archivematica have something where, given an identified format (Siegfried output, PRONOM identifier, whatever), an appropriate characterization tool is applied?

mjgiarlo commented 4 years ago

I greatly appreciate you having written this issue up with such thoroughness, @andrewjbtw

andrewjbtw commented 4 years ago

Archivematica does this with the Format Policy Registry. They've set it up in the interface so that users can choose and edit the commands. The user documentation is here:

https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/preservation/preservation-planning/

I'm pretty sure there used to be a separate FPR codebase, but I think it's been brought into the main Archivematica repo here: https://github.com/artefactual/archivematica/tree/qa/1.x/src/dashboard/src/fpr It's in Python, anyway.

There's a joint effort to create a common Preservation Action Registry so that different applications that do similar things - Preservica has their own functionality around this, and so do other commercial and open source products - could communicate, but it's hard to tell how far that's gone: http://parcore.org/

mjgiarlo commented 4 years ago

@justinlittman's friendly amendment for stage 1:

New file format identification
Running format-appropriate tools to generate metadata
Using a minimal data model (covering a small number of identified file types), persist technical metadata to a database

This would require getting buy-in on the idea of not persisting tech MD to the moab.

Stage 2 would include making the tech MD queryable, e.g., behind an HTTP API or in Solr?

andrewjbtw commented 4 years ago

I've updated the main ticket to reflect the decisions made at yesterday's technical metadata meeting with Tom, Vivian, Hannah, and Julian. We have the go-ahead for stage 1 to:

Store technical metadata in a techMD database
Stop storing it in the Moab

mjgiarlo commented 4 years ago

This issue was subsumed by sul-dlss/google-books#47. Closing.

sul-dlss / common-accessioning

[epic] A new approach to technical metadata #555

Background

A proposal

Possible steps forward

Risks and assumptions