Download GT, process with calamari, evaluate with dinglehopper and ocrd-cor-asv-ann-evaluate
NOTE This demo is just to show how to do the evaluation. The choice of OCR engines, evaluation processors and models is entirely arbitrary and should not be construed as approval or disapproval.
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/dda89351-7596-46eb-9736-593a5e9593d3/data/luz_blitz_1784.ocrd.zip
Extract the data
subdirectory of the ZIP (which contains the workspace)
unzip luz_blitz_1784.ocrd.zip 'data/*'
This workflow uses ocrd-olena-binarize
(with the sauvola-ms-split
algorithm) to binarize the images. The images are processed by two runs with
tesseract (once Fraktur_GT4HistOCR
, once deu
) and once with calamari (with
the qurator-gt4histocr-1.0
model).
ocrd process -m data/mets.xml \
"olena-binarize -I OCR-D-GT-SEG-LINE -O BIN" \
"tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model Fraktur_GT4HistOCR -I BIN -O TESS-GT4HIST" \
"tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model deu -I BIN -O TESS-DEU" \
"calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I BIN -O CALA-GT4HIST"
This allows us to compare the files in TESS-GT4HIST
, TESS-DEU
and
CALA-GT4HIST
with each other and with the GT in OCR-D-GT-SEG-LINE
.
ocrd-cor-asv-ann-evaluate -m data/mets.xml -I OCR-D-GT-SEG-LINE,TESS-GT4HIST,TESS-DEU,CALA-GT4HIST -O EVAL-ASV
The results are JSON files in the EVAL-ASV
filegroup with workspace-wise and page-wise line-by-line distance and variance between the GT and all the engine.
data/EVAL-ASV/EVAL-ASV.json
contains the metrics (mean CER and variance) for the full workspace:
{
"OCR-D-GT-SEG-LINE,TESS-GT4HIST": {
"length": 110,
"distance-mean": 0.032638315863287554,
"distance-varia": 0.010120613640730372
},
"OCR-D-GT-SEG-LINE,TESS-DEU": {
"length": 110,
"distance-mean": 0.17414861150552538,
"distance-varia": 0.030377095996637286
},
"OCR-D-GT-SEG-LINE,CALA-GT4HIST": {
"length": 110,
"distance-mean": 0.044792427193718676,
"distance-varia": 0.01339440642349274
}
}
data/EVAL-ASV/EVAL-ASV_0003.json
contains the metrics for the page 3.
ocrd-dinglehopper -m data/mets.xml -P textequiv_level line -I OCR-D-GT-SEG-LINE,CALA-GT4HIST -O EVAL-DINGLE
The result are HTML files (Diff View) and JSON files (with CER and WER).
HTML for page 3:
JSON for page 3:
{
"gt": "OCR-D-GT-SEG-LINE/OCR-D-GT-SEG-LINE_0003.xml",
"ocr": "CALA-GT4HIST/CALA-GT4HIST_0003.xml",
"cer": 0.07770472205618649,
"wer": 0.1320754716981132,
"n_characters": 1673,
"n_words": 265
}
Show diff view in browse-ocrd (https://github.com/hnesk/browse-ocrd/tree/diff-view)
ocrd process -m data/mets.xml \
"olena-binarize -I OCR-D-GT-SEG-LINE -O BIN" \
"tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model Fraktur_GT4HistOCR -I BIN -O TESS-GT4HIST" \
"tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model deu -I BIN -O TESS-DEU" \
"calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I BIN -O CALA-GT4HIST" \
"cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,TESS-GT4HIST,TESS-DEU,CALA-GT4HIST -O EVAL-ASV" \
"dinglehopper -P textequiv_level line -I OCR-D-GT-SEG-LINE,CALA-GT4HIST -O EVAL-DINGLE"