kba / vdhd-2021-05-12

0 stars 1 forks source link

vDHd Demo Evaluation

Download GT, process with calamari, evaluate with dinglehopper and ocrd-cor-asv-ann-evaluate

NOTE This demo is just to show how to do the evaluation. The choice of OCR engines, evaluation processors and models is entirely arbitrary and should not be construed as approval or disapproval.

Browse to and download from OCR-D GT Repo

wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/dda89351-7596-46eb-9736-593a5e9593d3/data/luz_blitz_1784.ocrd.zip

Extract the OCRD-ZIP

Extract the data subdirectory of the ZIP (which contains the workspace)

unzip luz_blitz_1784.ocrd.zip 'data/*'

Run small workflows for OCR results with tesseract and calamari, compare output

This workflow uses ocrd-olena-binarize (with the sauvola-ms-split algorithm) to binarize the images. The images are processed by two runs with tesseract (once Fraktur_GT4HistOCR, once deu) and once with calamari (with the qurator-gt4histocr-1.0 model).

ocrd process -m data/mets.xml \
  "olena-binarize -I OCR-D-GT-SEG-LINE -O BIN" \
  "tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model Fraktur_GT4HistOCR -I BIN -O TESS-GT4HIST" \
  "tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model deu -I BIN -O TESS-DEU" \
  "calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I BIN -O CALA-GT4HIST"

This allows us to compare the files in TESS-GT4HIST, TESS-DEU and CALA-GT4HIST with each other and with the GT in OCR-D-GT-SEG-LINE.

Compare all the OCR results with the GT using ocrd-cor-asv-ann-evaluate

ocrd-cor-asv-ann-evaluate -m data/mets.xml -I OCR-D-GT-SEG-LINE,TESS-GT4HIST,TESS-DEU,CALA-GT4HIST -O EVAL-ASV

The results are JSON files in the EVAL-ASV filegroup with workspace-wise and page-wise line-by-line distance and variance between the GT and all the engine.

data/EVAL-ASV/EVAL-ASV.json contains the metrics (mean CER and variance) for the full workspace:

{
  "OCR-D-GT-SEG-LINE,TESS-GT4HIST": {
    "length": 110,
    "distance-mean": 0.032638315863287554,
    "distance-varia": 0.010120613640730372
  },
  "OCR-D-GT-SEG-LINE,TESS-DEU": {
    "length": 110,
    "distance-mean": 0.17414861150552538,
    "distance-varia": 0.030377095996637286
  },
  "OCR-D-GT-SEG-LINE,CALA-GT4HIST": {
    "length": 110,
    "distance-mean": 0.044792427193718676,
    "distance-varia": 0.01339440642349274
  }
}

data/EVAL-ASV/EVAL-ASV_0003.json contains the metrics for the page 3.

Compare Calamari output with GT using dinglehopper

ocrd-dinglehopper -m data/mets.xml -P textequiv_level line -I OCR-D-GT-SEG-LINE,CALA-GT4HIST -O EVAL-DINGLE

The result are HTML files (Diff View) and JSON files (with CER and WER).

HTML for page 3:

JSON for page 3:

{
    "gt": "OCR-D-GT-SEG-LINE/OCR-D-GT-SEG-LINE_0003.xml",
    "ocr": "CALA-GT4HIST/CALA-GT4HIST_0003.xml",

    "cer": 0.07770472205618649,
    "wer": 0.1320754716981132,

    "n_characters": 1673,
    "n_words": 265
}

Visualize with browse-ocrd

Show diff view in browse-ocrd (https://github.com/hnesk/browse-ocrd/tree/diff-view)

Full workflow

ocrd process -m data/mets.xml \
  "olena-binarize -I OCR-D-GT-SEG-LINE -O BIN" \
  "tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model Fraktur_GT4HistOCR -I BIN -O TESS-GT4HIST" \
  "tesserocr-recognize -P segmentation_level word -P textequiv_level line -P find_tables true -P model deu -I BIN -O TESS-DEU" \
  "calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I BIN -O CALA-GT4HIST" \
  "cor-asv-ann-evaluate -I OCR-D-GT-SEG-LINE,TESS-GT4HIST,TESS-DEU,CALA-GT4HIST -O EVAL-ASV" \
  "dinglehopper -P textequiv_level line -I OCR-D-GT-SEG-LINE,CALA-GT4HIST -O EVAL-DINGLE"