sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects
Other
2 stars 1 forks source link

[epic] A new approach to technical metadata #555

Closed andrewjbtw closed 4 years ago

andrewjbtw commented 4 years ago

This is a proposal to improve SDR's generation of technical metadata to make it:

Background

SDR currently generates technical metadata by running JHOVE during accessioning and saving whatever information JHOVE outputs for the whole group of files aggregated together. This output is stored in both a Fedora datastream and an XML file, and the XML file is stored in the moab for that object. The technicalMetadata datastream can be accessed via the Argo interface but it is not indexed for search or faceting, and as far as we know there are no users who use it and no use cases for using it in its current state.

Over the years that SDR has been in operation, this approach to technical metadata generation and storage has run into the following issues:

  1. Out of memory errors during accessioning (see #311)
  1. Technical metadata takes a long time to generate
  1. Technical metadata is kept at the object level rather than the file level
  1. We are storing JHOVE output regardless of what it contains
  1. We do not update technical metadata as JHOVE improves and bugs are fixed
  1. It's difficult to access technical metadata in any programmatic way

A proposal

We should take a new approach to technical metadata generation and storage rather than attempt to patch the current system. I propose that we:

  1. Run a file format identification tool on files before running metadata extraction
  1. Based on the file format ID, select an appropriate tool to extract metadata from that file
  1. For specific sets of known formats, choose specific fields that we know we want and are likely to be present
  1. For formats not on the known list, store a minimal amount of metadata, mainly file system metadata
  1. Store this metadata in a structure that makes it easy to access individual file-level information

  2. Store the metadata somewhere that can be more easily queried and updated

  1. implement a process for updating technical metadata for files already in preservation storage

Possible steps forward

Fully implementing a new system for generating and storing technical metadata would be a significant undertaking. At the same time, we need to move forward with making changes because the existing process is not meeting our needs.

One possibility is to split the steps above into different stages.

Stage 1

This could be to implement a new way to generate technical metadata but continue to send it to the moab. This would involve:

  1. New file format identification
  2. Running format-appropriate tools to generate metadata
  3. Using a minimal data model (covering a small number of identified file types) and storing that data in a database for technical metadata
  4. No longer sending technical metadata to the moab
  5. Indexing technical metadata for Argo so at least there's some access point for that data

Stage 2

This would be the steps of Stage 1, plus:

  1. Creating a technical metadata service that's queryable on its own

Stage 3

Stages 1 and 2, plus:

  1. Implementing a process for regenerating technical metadata from copies in preservation
  2. and using that to update the technical metadata service periodically
  3. Making sure to update the technical metadata for all of the druids that were accessioned prior to the new method, and which still have only JHOVE metadata in their Moabs.

Possible long term needs beyond stage 3

In a meeting to discuss this proposal on 2019-02-11, Tom brought up the question of storing technical metadata in the archival package (i.e. inside the Moab) in the following situations:

  1. Export to another system (like DPN, if such a system arises again)
  2. Export to cold storage

While there are no immediate plans for either of these cases, we should keep in mind there will be possible use cases for exporting technical metadata from the database and associating it with individual objects. At the meeting we came to the conclusion that this would be better done as part of the export process and not prospectively before such a process has begun.

Risks and assumptions

There is a real possibility that even getting to stage 1 requires more work than we anticipate. We may find:

For stage 2, it could turn out that:

For stage 3:

There are also the general risks and constraints of adding to the work on the infrastructure team portfolio, given how much is already on the schedule. However, technical metadata does fall within the scope of SDR evolution and maintenance, and the existing bottlenecks with technical metadata are a real problem that must be overcome for Google Books and other projects already in progress.

ndushay commented 4 years ago

how about phase 1 - we start with the new approach for google books only (if they have unique format signature ... if not, start with new approach for 'book' format?)

jcoyne commented 4 years ago

@ndushay I don't want to have to maintain 2 ways to do anything. We already have too much complexity, so we must be consistent.

ndushay commented 4 years ago

@jcoyne I was thinking only about "how do we get from where we are to the new way" and "how does this help googlebooks" but I defer to your judgement.

mjgiarlo commented 4 years ago

Regarding:

For specific sets of known formats, choose specific fields that we know we want and are likely to be present examples would be along these lines

images: dimensions, color space, camera settings, etc. moving images: codec, container, pixel format, duration, resolution, number of tracks, etc.

For formats not on the known list, store a minimal amount of metadata, mainly file system metadata

filename, file size, file dates (created, modified), format ID

I want to share the following schemas of terms used in the Samvera community. These are fairly mature, having been used by a wide swath of the community over the past 7-8 years:

I'm not suggesting we need to go with these, but they may be worth reviewing when working on this.

mjgiarlo commented 4 years ago

I wonder if the mapping work described here:

Based on the file format ID, select an appropriate tool to extract metadata from that file

tools are TBD but it could be something like: for still image formats, use exiftool for A/V formats, use mediainfo for PDFs, use Apache Tika for office documents, use Apache Tika

has been tackled anywhere out in the open? For instance, might Archivematica have something where, given an identified format (Siegfried output, PRONOM identifier, whatever), an appropriate characterization tool is applied?

mjgiarlo commented 4 years ago

I greatly appreciate you having written this issue up with such thoroughness, @andrewjbtw

andrewjbtw commented 4 years ago

Archivematica does this with the Format Policy Registry. They've set it up in the interface so that users can choose and edit the commands. The user documentation is here:

https://www.archivematica.org/en/docs/archivematica-1.10/user-manual/preservation/preservation-planning/

I'm pretty sure there used to be a separate FPR codebase, but I think it's been brought into the main Archivematica repo here: https://github.com/artefactual/archivematica/tree/qa/1.x/src/dashboard/src/fpr It's in Python, anyway.

There's a joint effort to create a common Preservation Action Registry so that different applications that do similar things - Preservica has their own functionality around this, and so do other commercial and open source products - could communicate, but it's hard to tell how far that's gone: http://parcore.org/

mjgiarlo commented 4 years ago

@justinlittman's friendly amendment for stage 1:

This would require getting buy-in on the idea of not persisting tech MD to the moab.

Stage 2 would include making the tech MD queryable, e.g., behind an HTTP API or in Solr?

andrewjbtw commented 4 years ago

I've updated the main ticket to reflect the decisions made at yesterday's technical metadata meeting with Tom, Vivian, Hannah, and Julian. We have the go-ahead for stage 1 to:

  1. Store technical metadata in a techMD database
  2. Stop storing it in the Moab
mjgiarlo commented 4 years ago

This issue was subsumed by sul-dlss/google-books#47. Closing.