Find and fix systematic transcription and data issues

stweil commented 2 months ago

While trying to qualify the GT for a training with Tesseract I already noticed a number of issues. See https://github.com/ulb-sachsen-anhalt/ulb-groundtruth-eval-odem-ger/pull/2#issuecomment-2039080148 and https://github.com/stweil/ulb-groundtruth-eval-odem-ger/wiki for some of them.

I compared the GT with results from Tesseract OCR. 20915 of 39823 lines were different (LER = 52.1%). Many of the differences are either OCR errors or transcription errors, but a lot are also caused by different transcription rules. Here are some typical examples:

# OCR was trained to recognize I or J (depending on the context), but GT mostly uses J for uppercase I/J.
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-649008-p0275-1_ger.jpg_region0004_line0001.gt.txt
Im Nebenzimmer.
Jm Nebenzimmer.

# GT uses blanks in composed word with ⸗ while OCR only uses ⸗.
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-706939-p0084-4_ger.jpg_region0008_line0005.gt.txt
Suͤnden⸗Buͤrde anjetzo hier im heiligen
Suͤnden ⸗ Buͤrde anjetzo hier im heiligen

# Different usage of blanks.
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-707087-p0182-9_ger.jpg_region0016_line0003.gt.txt
Retractus, welchem ſelbiger zuſtehet / allezeit
Retractus, welchem ſelbiger zuſtehet/ allezeit

# Should it be "26. Ich will ſagen: Wo ſind ſie? Ich werde*"?
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-147967-p0245-7_ger.jpg_region0009_line0056.gt.txt
26. Ich will ſagen: Wo ſind ſie Ich werde“
26. Jch will jagen:Wo ſind ſie?Jch werde*

OCR for historic German text often creates Umlauts with both diaresis and small letter e. The GT is rather good, but also shows two such cases: häͤt (ä and ͤ) and Rebhüͤner (ü and ͤ):

# OCR is correct, GT is wrong here.
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-744636-p0123-5_ger.jpg_region0001_line0011.gt.txt
verfertiget hat.
verfertiget hat. häͤt. ⸗

The GT also contains only a small number of wrong transcriptions for the long s (which was transcribed as 's'):

data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-743803-p0456-1_ger.jpg_line_1668599220596_631.gt.txt
Ich hatte beſchloſſen, meine Betrachtungen auszuſtreichen; allein es ha⸗
Jch hatte beſchlossen, meine Betrachtungen auszuſtreichen; allein es ha⸗

data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-832298-p0072-5_ger.jpg_line_1663246370270_1218.gt.txt
erkennen, und zuvorderſt um wahre Bekehrung beken. Biſt du aber verſichert
erkennen, und zuvorderſt um wahre Bekehrung beten. Bist du aber verſichert,

data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-514143-p0301-1_ger.jpg_line_1668763752616_2801.gt.txt
ten von jeher mit den Thebanern in Feindſchaft gelebt,
ten von jeher mit den Thebanern in Feindschaft gelebt,

data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-167098-p0314-6_ger.jpg_line_1648115498371_1474.gt.txt
ſung von dem Geſetz zu unterſcheiden, ſo
ſung von dem Geſetz zu unterscheiden, ſo

stweil commented 2 months ago

Another issue is that the image heights of the JPEG files don't match the heights given in the PAGE XML files.

stweil commented 2 months ago

Empty Unicode (after pull request #2 was applied):

git grep 'Unicode.>'
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-227388-p0809-1_ger.gt.xml:        <Unicode/>

Unicode with leading blanks:

git grep '<Unicode> '|head
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-119243-p0243-3_ger.gt.xml:          <Unicode> und Leben der Engel wiſſen,</Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-147967-p0245-7_ger.gt.xml:            <Unicode> untreue</Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-159647-p0133-7_ger.gt.xml:            <Unicode> Von</Unicode>
[...]

Unicode with trailing blanks:

git grep ' </Unicode>'
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-136375-p0463-3_ger.gt.xml:          <Unicode>Schul⸗Meiſtere, bey Trauungen derer geſchwaͤngerten Perſonen, </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-209491-p0105-2_ger.gt.xml:          <Unicode>demſelben </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-626953-p1616-8_ger.gt.xml:            <Unicode>gewidmeter </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-645249-p0405-4_ger.gt.xml:            <Unicode>εκεινον </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-717543-p0245-9_ger.gt.xml:          <Unicode>8, 15. </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-719184-p0138-4_ger.gt.xml:            <Unicode>Morgen ⸗ </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-822890-p0724-1_ger.gt.xml:            <Unicode>⅔ </Unicode>

stweil commented 2 months ago

The GT contains very short baselines which are not usable for a training with kraken (example with only 3 pixels for the text "die Papiſten wegen ihres vorgegebenen Fegefeuers, ſondern").

Other baselines show weird coordinates, for example <Baseline points="1202,828 1268,827 1202,866 1268,865"/> which are also unusable.

stweil commented 2 months ago

Many image files have metadata which indicates an image resolution of 1x1 which obviously does not make sense.

M3ssman commented 2 months ago

Thanks for reporting these issues!

The overall data was first corrected using https://github.com/ulb-sachsen-anhalt/transkribus-swt-gui. Then there were some attempts to foster the GT by some internal scripts, mostly to detect geometric anomalies like less than 4 coordinate points. Finally we used https://github.com/kba/transkribus-to-prima to leverage the format to PAGE2019, which also revealed additional schema errors in Transkribus' PAGE 2013 flavor, which unfortunately already slipped my mind.

Concerning the transcription irregularities - they shouldn't have happened. The data was reviewed at least two times by human eyes.

I'll try to strip the spaces and prepare a new version tag with both your and my changes included.

M3ssman commented 2 months ago

I'm afraid all these problems (also the carriage carnage) apply to the other GT-repositories as well:

https://github.com/ulb-sachsen-anhalt/ulb-groundtruth-eval-odem-lat
https://github.com/ulb-sachsen-anhalt/ulb-groundtruth-eval-odem-other so need to check them too.

M3ssman commented 1 month ago

@stweil Did you use the latest working BagIt release for inspection?

Since I wonder where the issue with resolution 1x1 arises. I checked the local images used to create the original GT-data for german and they all contained reasonable ImageSize, XResolution and YResolution (300x300 DPI). Probably these information was lost when the image data got stored on our repositories' asset store or when these images got pulled to create the OCR-D-Bag.

ulb-sachsen-anhalt / ulb-groundtruth-eval-odem-ger

Find and fix systematic transcription and data issues #4