qurator-spk / ocrd_repair_inconsistencies

Automatically re-order lines, words and glyphs to become textually consistent with their parents.
Apache License 2.0
2 stars 3 forks source link
ocr ocr-d page page-xml

[!CAUTION] This was a one-off script, useful to solve a specific problem. We do not maintain it anymore, but in case you want to use it, we appreciate an e-mail to mike.gerber@sbb.spk-berlin.de 🕸

ocrd_repair_inconsistencies

Automatically re-order lines, words and glyphs to become textually consistent with their parents.

Introduction

PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates iff such re-ordering fixes the inconsistency between their appropriately concatenated TextEquiv texts with their parent's TextEquiv text.

If TextEquiv is missing, skip the respective elements.

Where available, respect the annotated visual order:

This processor does not affect ReadingOrder between regions, just the order of the XML elements below the region level, and only if not contradicting the annotated textLineOrder/readingDirection.

We wrote this as a one-shot script to fix some files. Use with caution.

Installation

(In your venv, run:)

make deps     # or pip install -r requirements.txt
make install  # or pip install .

Usage

Offers the following user interfaces:

OCR-D processor CLI ocrd-repair-inconsistencies

To be used with PageXML documents in an OCR-D annotation workflow.

Example

Use the following script to repair OCR-D-GT-PAGE annotation in workspaces, and then replace it with the output on success:

#!/bin/bash
set -e

tmp_fg=FIXED_$RANDOM

ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg

for f in "$tmp_fg"/*; do
  g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
  cp "$f" "$g"
done

ocrd workspace remove-group -rf $tmp_fg