Closed andrewjbtw closed 4 years ago
how about phase 1 - we start with the new approach for google books only (if they have unique format signature ... if not, start with new approach for 'book' format?)
@ndushay I don't want to have to maintain 2 ways to do anything. We already have too much complexity, so we must be consistent.
@jcoyne I was thinking only about "how do we get from where we are to the new way" and "how does this help googlebooks" but I defer to your judgement.
Regarding:
For specific sets of known formats, choose specific fields that we know we want and are likely to be present examples would be along these lines
images: dimensions, color space, camera settings, etc. moving images: codec, container, pixel format, duration, resolution, number of tracks, etc.
For formats not on the known list, store a minimal amount of metadata, mainly file system metadata
filename, file size, file dates (created, modified), format ID
I want to share the following schemas of terms used in the Samvera community. These are fairly mature, having been used by a wide swath of the community over the past 7-8 years:
I'm not suggesting we need to go with these, but they may be worth reviewing when working on this.
I wonder if the mapping work described here:
Based on the file format ID, select an appropriate tool to extract metadata from that file
tools are TBD but it could be something like: for still image formats, use exiftool for A/V formats, use mediainfo for PDFs, use Apache Tika for office documents, use Apache Tika
has been tackled anywhere out in the open? For instance, might Archivematica have something where, given an identified format (Siegfried output, PRONOM identifier, whatever), an appropriate characterization tool is applied?
I greatly appreciate you having written this issue up with such thoroughness, @andrewjbtw
Archivematica does this with the Format Policy Registry. They've set it up in the interface so that users can choose and edit the commands. The user documentation is here:
I'm pretty sure there used to be a separate FPR codebase, but I think it's been brought into the main Archivematica repo here: https://github.com/artefactual/archivematica/tree/qa/1.x/src/dashboard/src/fpr It's in Python, anyway.
There's a joint effort to create a common Preservation Action Registry so that different applications that do similar things - Preservica has their own functionality around this, and so do other commercial and open source products - could communicate, but it's hard to tell how far that's gone: http://parcore.org/
@justinlittman's friendly amendment for stage 1:
This would require getting buy-in on the idea of not persisting tech MD to the moab.
Stage 2 would include making the tech MD queryable, e.g., behind an HTTP API or in Solr?
I've updated the main ticket to reflect the decisions made at yesterday's technical metadata meeting with Tom, Vivian, Hannah, and Julian. We have the go-ahead for stage 1 to:
This issue was subsumed by sul-dlss/google-books#47. Closing.
This is a proposal to improve SDR's generation of technical metadata to make it:
Background
SDR currently generates technical metadata by running JHOVE during accessioning and saving whatever information JHOVE outputs for the whole group of files aggregated together. This output is stored in both a Fedora datastream and an XML file, and the XML file is stored in the moab for that object. The technicalMetadata datastream can be accessed via the Argo interface but it is not indexed for search or faceting, and as far as we know there are no users who use it and no use cases for using it in its current state.
Over the years that SDR has been in operation, this approach to technical metadata generation and storage has run into the following issues:
A proposal
We should take a new approach to technical metadata generation and storage rather than attempt to patch the current system. I propose that we:
Store this metadata in a structure that makes it easy to access individual file-level information
Store the metadata somewhere that can be more easily queried and updated
Possible steps forward
Fully implementing a new system for generating and storing technical metadata would be a significant undertaking. At the same time, we need to move forward with making changes because the existing process is not meeting our needs.
One possibility is to split the steps above into different stages.
Stage 1
This could be to implement a new way to generate technical metadata but continue to send it to the moab. This would involve:
Stage 2
This would be the steps of Stage 1, plus:
Stage 3
Stages 1 and 2, plus:
Possible long term needs beyond stage 3
In a meeting to discuss this proposal on 2019-02-11, Tom brought up the question of storing technical metadata in the archival package (i.e. inside the Moab) in the following situations:
While there are no immediate plans for either of these cases, we should keep in mind there will be possible use cases for exporting technical metadata from the database and associating it with individual objects. At the meeting we came to the conclusion that this would be better done as part of the export process and not prospectively before such a process has begun.
Risks and assumptions
There is a real possibility that even getting to stage 1 requires more work than we anticipate. We may find:
For stage 2, it could turn out that:
For stage 3:
There are also the general risks and constraints of adding to the work on the infrastructure team portfolio, given how much is already on the schedule. However, technical metadata does fall within the scope of SDR evolution and maintenance, and the existing bottlenecks with technical metadata are a real problem that must be overcome for Google Books and other projects already in progress.