ulb-sachsen-anhalt / ocrd-odem

OCR Workflows based on OCR-D
MIT License
3 stars 1 forks source link

ULB ODEM

Python application

Project of the University and State Library Sachsen-Anhalt (ULB Sachsen-Anhalt) for OCR-D-Phase III founded by DFG 2021-2024 to implement an OCR-D-based Workflow for fulltext generation for existing digitalisates of "Drucke des 18. Jahrhunderts (VD18)".

Digitized prints are accessed as records via OAI-PMH from a record list which, at the time of project start, included about 40.000 prints (monographs and multivolumes) with total about 6Mio pages. Corresponding images are load to a local worker machine, then each page is processed individually with a complete OCR-D-Workflow. Afterwards, the results are transformed into ALTO-OCR and an archive file containing a new complete PDF for the print with textlayer is generated. The resulting archive file complies to the SAF fileformat of DSpace-Systems like Share_it.

Features

Runtime Requirements

Installation

# clone
git clone <repo-url> <local-dir>

# setup python venv
python3 -m venv venv
pip install -U pip
pip install -r requirements.txt

# run tests
python -m pip install pytest-cov
python -m pytest --cov=lib tests/ -v

Configuration

Options can found in the following sections:

See for example resources/odem.ocrd.tesseract.ini.

Trigger Workflow via Crontab

Usually there is a record list (simple CSV-file) in the backend managed by cli_record_server.py module, which needs to be started. Please note, that no authentication restrictions are include. Ensure yourself it runs only in closed network environments.

ODEM client instances can be executed peridically, triggered by server cron jobs entries. Assuming there is local installation in /home/ocr/odem and a custom configurations located at <PROJECT>/resources/, it may look like this:

Start server process:

cd /home/ode/odem
python cli_record_server.py resources/odem.ocrd.tesseract.ini

Crontab entry for executing actual worker:

PYTHON_BIN=/home/ocr/odem/venv/bin/python3
PROJECT=/home/ocr/odem
RECORD_LIST=oai-records-opendata-vd18-odem

*/5  08-23  * * *  ${PYTHON_BIN} ${PROJECT}/cli_record_server_client.py ${RECORD_LIST} -c ${PROJECT}/resources/odem.ocr-worker01.ini -l

License

This project's source code is licensed under terms of the MIT license.