cneud commented 4 years ago

Document our processing pipeline:

[x] textline extraction @sbb_textline_detector
[x] word segmentation + OCR @ocrd_tesserocr
[x] Tokenization @SoMaJo
[x] Pretagging @sbb_ner

cneud commented 4 years ago

Create Preprocessing.md in 564a9ee851d56bd9060d36368b9ffd510ebe59df

cneud commented 4 years ago

Now here: Provenance.md

cneud commented 4 years ago

Provenance

The processing pipeline that is developed at the Berlin State Library comprises the following steps:

Layout Analysis & Textline Extraction

Layout Analysis & Textline Extraction @sbb_textline_detector

OCR & Word Segmentation

OCR is based on OCR-D's ocrd_tesserocr which requires Tesseract >= 4.1.0. The GT4HistOCR_2000000 model, which is trained on the GT4HistOCR corpus, is used. Further details are available in the paper.

TSV Transformation

A simple Python tool is used for the transformation of the OCR results in PAGE-XML to TSV.

Tokenization

For tokenization, SoMaJo is used.

Named Entity Recognition

For Named Entity Recognition, a BERT-Base model was trained for noisy OCR texts with historical spelling variation. sbb_ner is using a combination of unsupervised training on a large (~2.3m pages) corpus of German OCR in combination with supervised training on a small (47k tokens) annotated corpus. Further details are available in the paper.

qurator-spk / neat

Provenance #23

Provenance

Layout Analysis & Textline Extraction

OCR & Word Segmentation

TSV Transformation

Tokenization

Named Entity Recognition