slub / mets-mods2tei

Convert bibliographic meta data in MODS format to TEI headers
Apache License 2.0
8 stars 7 forks source link
conversion mets mods tei

mets-mods2tei

CircleCI codecov PyPI version

Convert bibliographic meta data in METS/MODS format to TEI headers and optionally serialize linked ALTO-encoded OCR to TEI text.

Background

MODS is the de-facto standard for encoding bibliographic meta data in libraries. It is usually included as a separate section into METS XML files. Physical and logical structure of a document are expressed in terms of structural mappings (structMap elements).

TEI is the de-facto standard for representing digital text for research purposes. It usually includes detailed bibliographic meta data in its header.

Since these standards contain a considerable amount of degrees of freedom, the conversion uses well-defined subsets. For MODS, this is the MODS Anwendungsprofil für digitalisierte Medien. For METS, the METS Anwendungsprofil für digitalisierte Medien 2.1 is consulted. For the TEI Header, the conversion is roughly based on the DTA base format.

mets-mods2tei is developed at the Saxon State and University Library in Dresden.

Installation

mets-mods2tei is implemented in Python 3. In the following, we assume a working Python 3 (tested versions 3.5, 3.6 and 3.7) installation.

Setup Python

Using virtual environments is highly recommended, although not strictly necessary for installing mets-mods2tei.

To create a virtual environement in a subdirectory of your choice (e.g. env), run

python3 -m venv env

(once) and then activate it (each time you open the shell) via

. env/bin/activate

Depending on how old the packages are which your base system provides, you might have to update pip first:

pip install -U pip setuptools

Get Python package

mets-mods2tei can be installed via pip3 directly. You can install from either the repository sources or the prebuilt distribution on PyPI:

From repository

If you have an active virtual environment, do

pip install mets-mods2tei

Otherwise, try

pip3 install --user mets-mods2tei

From source

Get the repository:

git clone https://github.com/slub/mets-mods2tei.git
cd mets-mods2tei

If you have an active virtual environment, do

pip install .

Otherwise, try

pip3 install --user .

Testing

mets-mods2tei uses pytest-based testing.

To install the prerequisites for testing, (in your venv), do

pip install -r requirements-test.txt

(once) and then run the tests via:

pytest

Code coverage

Determine code coverage by running

make coverage

Usage

mm2tei

Installing mets-mods2tei makes the command-line tool mm2tei available:

mm2tei --help

``` Usage: mm2tei [OPTIONS] METS METS: File containing or URL pointing to the METS/MODS XML to be converted Parse given METS and its meta-data, and convert it to TEI. If `--ocr` is given, then also read the ALTO full-text files from the fileGrp in `--text-group`, and convert page contents accordingly (in physical order). Decorate page boundaries with image and page numbers. Moreover, if `--add- refs` contains `page`, then reference the corresponding base image files (by file name) from `--img-group`. Likewise, if `--add-refs` contains `line`, then reference the corresponding textline segments (by XML ID) from `--text- group`. Output XML to `--output (use '-' for stdout), log to stderr.` Options: -O, --output FILENAME File path to write TEI output to -o, --ocr Serialize OCR into resulting TEI -T, --text-group TEXT File group which contains the full-text -I, --img-group TEXT File group which contains the images -r, --add-refs [page|line] -l, --log-level [DEBUG|INFO|WARN|ERROR|OFF] -h, --help Show this message and exit. ```

It reads METS XML via URL or file argument and prints the resulting TEI, including the extracted information from the MODS part of the METS.

Example:

mm2tei -O tei.xml "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263"

mm-update

Installing mets-mods2tei also provides the command-line multi-cmd tool mm-update:

mm-update --help

``` Usage: mm-update [OPTIONS] COMMAND [ARGS]... Entry-point of multi-purpose CLI for DFG Viewer compliant METS updates Options: --version Show the version and exit. -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level -d, --directory WORKSPACE_DIR Changes the workspace folder location [default: METS_URL directory or .]" -m, --mets METS_URL The path/URL of the METS file [default: WORKSPACE_DIR/mets.xml] --backup Backup METS whenever it is saved. --help Show this message and exit. Commands: add-agent add agent headers, optionally from external METS add-file add a file reference, optionally as URL download download files into subdirectories, as path or URL remove-file remove all file references for a specific location,... remove-files remove all file references for a specific fileGrp / MIME... validate custom OcrdWorkspaceValidator ```

mm-update add-agent --help

``` Usage: mm-update add-agent [OPTIONS] add agent headers, optionally from external METS Options: -m, --mets TEXT copy metsHdr/agent from this file, too --help Show this message and exit. ```

mm-update add-file --help

``` Usage: mm-update add-file [OPTIONS] PATH add a file reference, optionally as URL Options: -G, --file-grp FILE_GRP fileGrp to add to [required] -m, --mimetype TYPE Media type of the file. Guessed from extension if not provided -g, --page-id PAGE_ID ID of the physical page (or empty if document- global) -u, --url-prefix TEXT URL prefix to add to path before storing references (or else keep local file refs) --help Show this message and exit. ```

mm-update remove-file --help

``` Usage: mm-update remove-file [OPTIONS] PATH remove all file references for a specific location, optionally as URL Options: -u, --url-prefix TEXT URL prefix to add to path before removing references (or else search verbatim file refs) --help Show this message and exit. ```

mm-update remove-files --help

``` Usage: mm-update remove-files [OPTIONS] remove all file references for a specific fileGrp / MIME type / page ID combination Options: -G, --file-grp FILE_GRP fileGrp to add to [required] -m, --mimetype TYPE Media type of the file. Guessed from extension if not provided -g, --page-id PAGE_ID ID of the physical page (or empty if document- global) --help Show this message and exit. ```

mm-update validate --help

``` Usage: mm-update validate [OPTIONS] custom OcrdWorkspaceValidator Options: -u, --url-prefix TEXT validate each file has this URL prefix --help Show this message and exit. ```

mm-update download --help

``` Usage: mm-update download [OPTIONS] download files into subdirectories, as path or URL Options: -G, --file-grp FILE_GRP fileGrp USE (or empty if all fileGrps) -g, --page-id PAGE_ID ID of the physical page (or empty if all pages) -p, --path-names [URL|GRP/ID.SUF] how to generate local path names (from URL or from fileGrp, file ID and suffix) [default: URL] -u, --url-prefix TEXT URL prefix to remove from path before storing downloaded files (to avoid creating host directories) -r, --reference [no-change|replace-by-local|insert-local|append-local] whether and how to update the FLocat reference in METS [default: no-change] --help Show this message and exit. ```

Example:

# dump files (without changing METS):
mm-update download -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/
...
# add TEI
mm-update add-file -G TEI -m application/tei+xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ tei.xml
...
# remove old PDF:
mm-update remove-files -G DOWNLOAD
# add new PDF:
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0001 pdf/file_0001.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0002 pdf/file_0002.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0003 pdf/file_0003.pdf
mm-update add-file -G DOWNLOAD -m application/pdf -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ pdf/all.pdf
...
# remove old ALTO:
mm-update remove-files -G FULLTEXT -g PHYS_0001
mm-update remove-files -G FULLTEXT -g PHYS_0002
mm-update remove-files -G FULLTEXT -g PHYS_0003
# add new ALTO:
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0001 ocr/alto_0001.xml
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0002 ocr/alto_0002.xml
mm-update add-file -G FULLTEXT -m text/xml -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/ -g PHYS_0003 ocr/alto_0003.xml
...
# validate:
mm-update validate -u https://digital.slub-dresden.de/data/kitodo/GottDie_453779263/

Maintainers

If you have any questions or encounter any problems, please do not hesitate to contact us.