poke1024 / origami

A suite of batches and tools for OCR tasks.
71 stars 15 forks source link

πŸ›‘πŸ›‘πŸ›‘ IMPORTANT NOTE FROM 2022/04/24: ORIGAMI IS NOW LEGACY πŸ›‘πŸ›‘πŸ›‘

Origami's segmentation model was trained on an old version of TensorFlow. To run it, you need to install a working version of TensorFlow that is no higher than 2.1.x. This proves pretty unfeasible now on basically all current OS / CPU / GPU configurations.

Origami's OCR uses Calamari v1, which proves similarly difficult to install now.

Therefore, Origami has been retired. The repository status has been changed to "archive".

Origami

Origami is a self-contained suite of batches and tools for OCR processing of historical newspapers. It covers many essential steps in a digitization pipeline, including (1) building training data for training models, and (2) generating Page-XML OCR output from pages using trained models.

Apart from its specific features, Origami is

Origami's current default implementation features:

Origami also provides additional tools for:

Installing Origami

We provide two options for Installing Origami:

Make sure you take a look at the scripts under quickstart.

Installing with Docker

  1. Download and install Docker.

  2. Install the NVIDIA container toolkit (necessary for GPU usage). See here for installation instructions.

  3. Build the docker container (NOTE: this process can take ~20 minutes or more, as the container builds Scikit-Geometry from source.):

    cd docker
    docker buildx build -t "origami:origami-gpu" .

    This creates a docker image origami:origami-gpu.

  4. Launch the container. You must specify the location of your local copy of the Origami repo, as shown below:

    docker run --gpus all -it --rm -v /the/local/path/to/origami/:/origami origami:origami-gpu bash

    This runs the container and presents you with an interactive shell, ready to run Origami (located in /origami/).

    NOTE: Origami requires some additional set-up to run (e.g., downloading the segmentation models). See below for details.

Installing Locally

With conda (recommended)

If you have access to conda, it is easiest to use the following conda descriptions:

as in, for example, conda env create -f requirements/origami_cpu.yml.

Note that the requirements have been split into a GPU part (necessary for the Origami segment and ocr stages) and a CPU part (suitable for all other Origami stages). This simplifies dependency management with Tensorflow. Also, it is usually the split you would go for when running this system on a cluster that is separated into GPU and CPU nodes.

Make sure you take a look at the scripts under quickstart.

Without conda

Take a look at requirements/legacy and try the following:

cd origami
conda create --name origami python=3.7 -c defaults -c conda-forge --file requirements/legacy/conda.txt
conda activate origami
pip install -r requirements/legacy/pip.txt

Troubleshooting scikit-geometry

On some systems (e.g. macOS 10.15.7) the conda installation of scikit-geometry is broken. In these cases, you can always build scikit-geometry from scratch, i.e.:

conda activate origami
git clone https://github.com/scikit-geometry/scikit-geometry
cd scikit-geometry
python setup.py install

General Usage

cd /path/to/origami
python -m origami.batch.detect.segment

All command line tools will give you help information on their arguments when called as above.

The given data path should contain processed pages as images. Generated data is put into the same path. Images may be structured into any hierarchy of sub folders.

Make sure you take a look at the scripts under quickstart for an example of a complete pipeline.

Batches

Artifacts

Origami's processing happens in separated stages, with batches that read and write information from well-defined files (also called artifacts). Each batch creates and depends upon various artifacts, as shown in the following table. Rows depict artifacts, columns depict detection batches (i.e. the batches found under origami.batch.detect). Blank circles indicate a read, filled circles indicate a write. As illustrated here, later batches depend on information provided by earlier batches.

Click on the names of the artifacts (left column) or batches (top row) below to get more information.

segment contours flow dewarp layout lines order ocr compose
page image
segment.zip
contours.0.zip
flow.zip
lines.0.zip
contours.1.zip
dewarp.zip
contours.2.zip
tables.json
contours.3.zip
lines.3.zip
order.json
ocr.zip
compose.zip

Running Batches

Order

Given an OCR model, and as illustrated in the table from last section, the necessary order of detection batches for performing OCR for a folder of documents is:

1 segment
2 contours
3 flow
4 dewarp
5 layout
6 lines
7 order
8 ocr
9 compose

Concurrency

Batch processes can be run concurrently. Origami supports file-based locking or by using a database (see --lock-strategy). The latter strategy is more compatible and set by default. Use --lock-database to specify the path to a lock database (if none is specified, Origami will create one in your data folder).

Modifying Results

It is possible to replace Origami pipeline stages/batches by custom implementations by simply reading and writing Origami's artifacts using the documented file formats.

It is also possible to run Origami stages and then postprocess the generated artifacts before continuing with later stages.

The Detection Batches

segment

origami.batch.detect.segment
Performs segmentation (e.g. separation into text and background) on all images using a neural network model.
If you have not trained a custom model, you should download and use origami’s default model. You need to specify the path to that downloaded model via the `--model` argument when calling `origami.batch.detect.segment`.
The predicted classes and labels are embedded in the specified model.

contours

origami.batch.detect.contours
From the pixelwise segmentation information, detects connected components to produce vectorized polygonal contours for blocks and separator lines.

flow

origami.batch.detect.flow
Detects baselines and warping in separators to produce an overall description of page curvature.

dewarp

origami.batch.detect.dewarp
Creates a dewarping transformation that is used in subsequent stages.

layout

origami.batch.detect.layout
Refines regions by fixing over- and under-segmentation via heuristic rules.

lines

origami.batch.detect.lines
Detects baselines and line boundaries for each text line.

order

origami.batch.detect.order
Finds a reading order using a variant of the XY Cut algorithm.

ocr

origami.batch.detect.ocr
Performs OCR on each detected line using the specified Calamari OCR model. For more details on OCR models, see the section on Origami OCR models..

compose

origami.batch.detect.compose
Composes text into one file using the detected reading order. Can also produce PageXML output.

Debugging

origami.batch.detect.stats
Prints out statistics on computed artifacts and errors. This is useful for understanding how many pages for processed, and for which stages this processing is finished.
origami.batch.annotate.contours
Produces debug images for understanding the result of the contours batch stage.
origami.batch.annotate.lines
Produces debug images for understanding the line detection stage.
origami.batch.annotate.layout
Produces debug images for understanding the result of the layout and order batch stage.

Tools for Ground Truth and Evaluation

Tools

origami.tool.annotate
Tool for annotating, viewing and searching for ground truth.
origami.tool.pick
Tool for adding or removing single lines from the ground truth for fine tuning.
origami.tool.sample
Create a new annotation database by randomly sampling lines from a corpus. The details of sampling (numbers of items for each segmentation label type per page) can be specified. Allows import of transcriptions stored in accompanying PageXML. See command line help for more details.
origami.tool.schema
⁂ Run an annotation normalization schema on the given ground truth text files.
origami.tool.export
From the given annotation database, export line images of the specified height and binarization together with accompanying ground truth text files. Annotation normalization through a schema is supported. Use this command to generate training data for Calamari. See command line for details.
origami.tool.xycut
Debug internal X-Y cut implementation.
origami.batch.export.lines (debugging only)
Export images of lines detected during lines batch.
origami.batch.export.pagexml (debugging only)
Export polygons of lines detected during lines batch as PageXML.

How to create ground truth

For generating ground truth for training an OCR engine from a corpus, we suggest this general process:

Origami Models

For line-based OCR, Origami uses Calamari internally and therefore can be used with any Calamari model.

However, Origami's way of segmenting lines is slightly different from other pipelines: lines are not binarized and they are not scaled horizontally (therefore they might be wider than what some models are trained on).

One model specifically trained for Origami is the model used to perform OCR on the Berliner BΓΆrsen-Zeitung. The model (and more context on its training) is available under https://github.com/poke1024/origami_models

Another suitable model is the GT4HistOCR model for Calamari. Note that you need to enable binarization in the OCR for the latter.

Evalulation via Dinglehopper

To evaluate performance using Dinglehopper, you probably want to use:

python -m origami.batch.utils.evaluate DATA_PATH

Alternatively, you can create PAGE XMLs manually:

python -m origami.batch.detect.compose DATA_PATH \
    --page-xml --only-page-xml-regions \
    --regions regions/TEXT \
    --ignore-letters "{}[]"