Path to network implementation of OCR-D
In the final form, the controller will implement (most parts of) the OCR-D Web API.
Build or pull the Docker image:
make build # or docker pull ghcr.io/slub/ocrd_controller
Then run the container – providing host-side directories for the volumes …
DATA
: directory for data processing (including images or existing workspaces),MODELS
: directory for persistent storage of processor resource files,~/.local/share
; models will be under ./ocrd-resources/*
CONFIG
: directory for persistent storage of processor resource list,~/.config
; file will be under ./ocrd/resources.yml
… but also a file KEYS
with public key credentials for log-in to the controller, and (optionally) some environment variables …
WORKERS
: number of parallel jobs (i.e. concurrent login sessions for ocrd
)
(should be set to match the available computing resources)UID
: numerical user identifier to be used by programs in the containerGID
: numerical group identifier to be used by programs in the containerUMASK
: numerical user mask to be used by programs in the containerPORT
: numerical TCP port to expose the SSH server on the host sideNETWORK
name of the Docker network to usebridge
(the default Docker network)… thus, for example:
make run DATA=/mnt/workspaces MODELS=~/.local/share KEYS=~/.ssh/id_rsa.pub PORT=8022 WORKERS=3
Then you can log in as user ocrd
from remote (but let's use controller
in the following –
without loss of generality):
ssh -p 8022 ocrd@controller bash -i
Unless you already have the data in workspaces, you need to create workspaces prior to processing. For example:
ssh -p 8022 ocrd@controller "ocrd-import -P some-document"
For actual processing, you will first need to download some models
into your MODELS
volume:
ssh -p 8022 ocrd@controller "ocrd resmgr download ocrd-tesserocr-recognize *"
Subsequently, you can use these models on your DATA
files:
ssh -p 8022 ocrd@controller "ocrd process -m some-document/mets.xml 'tesserocr-recognize -P segmentation_level region -P model Fraktur'"
# or equivalently:
ssh -p 8022 ocrd@controller "ocrd-tesserocr-recognize -m some-document/mets.xml -P segmentation_level region -P model Fraktur"
If your data files cannot be directly mounted on the host (not even as a network share),
then you can use rsync
, scp
or sftp
to transfer them to the server:
rsync --port 8022 -av some-directory ocrd@controller:/data
scp -P 8022 -r some-directory ocrd@controller:/data
echo put some-directory /data | sftp -P 8022 ocrd@controller
Analogously, to transfer the results back:
rsync --port 8022 -av ocrd@controller:/data/some-directory .
scp -P 8022 -r ocrd@controller:/data/some-directory .
echo get /data/some-directory | sftp -P 8022 ocrd@controller
For parallel processing, you can either
Note: internally, WORKERS
is implemented as a (GNU parallel-based) semaphore
wrapping the SSH sessions inside blocking sem --fg
calls within .ssh/rc.
Thus, commands will get queued but not processed until a 'worker' is free.
All logs are accumulated on standard output, which can be inspected via Docker:
docker logs ocrd_controller
If you have any questions or encounter any problems, please do not hesitate to contact me.