slub / ocrd_manager

frontend for ocrd_controller and adapter towards ocrd_kitodo
MIT License
11 stars 3 forks source link
ocr-d

OCR-D Manager

OCR-D Manager is a server that mediates between Kitodo and OCR-D. It resides on the site of the Kitodo installation (so the actual OCR server can be managed independently) but runs in its own container (so Kitodo can be managed independently).

Specifically, it gets called by Kitodo.Production or Kitodo.Presentation to handle OCR for a document, and in turn calls the OCR-D Controller for workflow processing.

For an integration as a service container, orchestrated with other containers (Kitodo+Controller+Monitor), see this meta-repo.

OCR-D Manager is responsible for

It is currently implemented as SSH login server with an installation of OCR-D core and an SSH client to connect to the Controller.

Usage

Building

Build or pull the Docker image:

make build # or docker pull ghcr.io/slub/ocrd_manager

Starting and mounting

Then run the container – providing a host-side directory for the volumes …

… but also files …

… and (optionally) some environment variables

… thus, for example:

make run DATA=/mnt/workspaces WORKFLOWS=/mnt/workflows KEYS=~/.ssh/id_rsa.pub PORT=9022 PRIVATE=~/.ssh/id_rsa

(You can also run the service via docker-compose manually – just cp .env.example .env and edit to your needs.)

General management

Then you can log in as user ocrd from remote (but let's use manager in the following – without loss of generality):

ssh -p 9022 ocrd@manager bash -i

(Typically though, you will run a non-interactive script, see next section.)

Processing

In the Manager, you can run shell scripts that do

The data management will depend on which Kitodo context you want to integrate into (Production 2 / 3 or Presentation).

From image to ALTO files

For Kitodo.Production, there is a preconfigured script process_images.sh (or for_production.sh) which takes the following arguments:

SYNOPSIS:

process_images.sh [OPTIONS] DIRECTORY

where OPTIONS can be any/all of:
 --lang LANGUAGE    overall language of the material to process via OCR
 --script SCRIPT    overall script of the material to process via OCR
 --workflow FILE    workflow file to use for processing, default:
                    ocr-workflow-default.sh
 --no-validate      skip comprehensive validation of workflow results
 --img-subdir IMG   name of the subdirectory to read images from, default:
                    images
 --ocr-subdir OCR   name of the subdirectory to write OCR results to, default:
                    ocr/alto
 --proc-id ID       process ID to communicate in ActiveMQ callback
 --task-id ID       task ID to communicate in ActiveMQ callback
 --help             show this message and exit

and DIRECTORY is the local path to process. The script will import
the images from DIRECTORY/IMG into a new (temporary) METS and
transfer this to the Controller for processing. After resyncing back
to the Manager, it will then extract OCR results and export them to
DIRECTORY/OCR.

If ActiveMQ is used, the script will exit directly after initialization,
and run processing in the background. Completion will then be signalled
via ActiveMQ network protocol (using the proc and task ID as message).

ENVIRONMENT VARIABLES:

 CONTROLLER: host name and port of OCR-D Controller for processing
 ACTIVEMQ: URL of ActiveMQ server for result callback (optional)
 ACTIVEMQ_CLIENT: path to ActiveMQ client library JAR file (optional)

The workflow parameter is optional and defaults to the preconfigured script ocr-workflow-default.sh which contains a trivial workflow:

It can be replaced with the (path) name of any workflow script mounted under /workflows or /data.

For example (assuming testdata is a directory with image files mounted under /data):

ssh -T -p 9022 ocrd@manager process_images.sh --proc-id 1 --task-id 3 --lang deu --script Fraktur --workflow myocr.sh testdata

From METS to METS file

For Kitodo.Presentation, there is a preconfigured script process_mets.sh (or for_presentation.sh) which takes the following arguments:

SYNOPSIS:

process_mets.sh [OPTIONS] METS

where OPTIONS can be any/all of:
 --workflow FILE    workflow file to use for processing, default:
                    ocr-workflow-default.sh
 --no-validate      skip comprehensive validation of workflow results
 --pages RANGE      selection of physical page range to process
 --img-grp GRP      fileGrp to read input images from, default:
                    DEFAULT
 --ocr-grp GRP      fileGrp to write output OCR text to, default:
                    FULLTEXT
 --url-prefix URL   convert result text file refs from local to URL
                    and prefix them
 --help             show this message and exit

and METS is the path of the METS file to process. The script will copy
the METS into a new (temporary) workspace and transfer this to the
Controller for processing. After resyncing back, it will then extract
OCR results and copy them to METS (adding file references to the file
and copying files to the parent directory).

ENVIRONMENT VARIABLES:

 CONTROLLER: host name and port of OCR-D Controller for processing

For the workflow parameter, the same goes here as above.

For example (assuming testdata is a directory with image files mounted under /data):

ssh -T -p 9022 ocrd@manager process_mets.sh --lang deu --script Fraktur --workflow myocr.sh testdata/mets.xml

Data transfer

For sharing data between the Manager and Controller, it is recommended to transfer files explicitly (as this will make the costs more measurable and controllable).

(This is currently implemented via rsync.)

The data lifecycle should be:

(This is currently not managed.)

Logging

All logs are accumulated on standard output, which can be inspected via Docker:

docker logs ocrd_manager

Logs for all services can also be viewed on the Monitor web server.

Testing

After building and starting, you can use the test target for a round-trip:

make test DATA=/mnt/workspaces

This will download sample data and run the default workflow on them. (All logging is still accumulated on the Docker output, so the shell itself will not print any. See above)

(If the Manager has been started externally already, make sure to pass the correct value for the NETWORK variable – the makefile will then attempt to use docker exec instead of ssh ocrd@localhost to connect.)

To clean up the results, use:

make clean-testdata

Maintainers

If you have any questions or encounter any problems, please do not hesitate to contact us.