Open stweil opened 2 months ago
Another issue is that the image heights of the JPEG files don't match the heights given in the PAGE XML files.
Empty Unicode (after pull request #2 was applied):
git grep 'Unicode.>'
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-227388-p0809-1_ger.gt.xml: <Unicode/>
Unicode with leading blanks:
git grep '<Unicode> '|head
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-119243-p0243-3_ger.gt.xml: <Unicode> und Leben der Engel wiſſen,</Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-147967-p0245-7_ger.gt.xml: <Unicode> untreue</Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-159647-p0133-7_ger.gt.xml: <Unicode> Von</Unicode>
[...]
Unicode with trailing blanks:
git grep ' </Unicode>'
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-136375-p0463-3_ger.gt.xml: <Unicode>Schul⸗Meiſtere, bey Trauungen derer geſchwaͤngerten Perſonen, </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-209491-p0105-2_ger.gt.xml: <Unicode>demſelben </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-626953-p1616-8_ger.gt.xml: <Unicode>gewidmeter </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-645249-p0405-4_ger.gt.xml: <Unicode>εκεινον </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-717543-p0245-9_ger.gt.xml: <Unicode>8, 15. </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-719184-p0138-4_ger.gt.xml: <Unicode>Morgen ⸗ </Unicode>
data/ger/GT-PAGE/urn+nbn+de+gbv+3+1-822890-p0724-1_ger.gt.xml: <Unicode>⅔ </Unicode>
The GT contains very short baselines which are not usable for a training with kraken (example with only 3 pixels for the text "die Papiſten wegen ihres vorgegebenen Fegefeuers, ſondern").
Other baselines show weird coordinates, for example <Baseline points="1202,828 1268,827 1202,866 1268,865"/>
which are also unusable.
Many image files have metadata which indicates an image resolution of 1x1 which obviously does not make sense.
Thanks for reporting these issues!
The overall data was first corrected using https://github.com/ulb-sachsen-anhalt/transkribus-swt-gui. Then there were some attempts to foster the GT by some internal scripts, mostly to detect geometric anomalies like less than 4 coordinate points. Finally we used https://github.com/kba/transkribus-to-prima to leverage the format to PAGE2019, which also revealed additional schema errors in Transkribus' PAGE 2013 flavor, which unfortunately already slipped my mind.
Concerning the transcription irregularities - they shouldn't have happened. The data was reviewed at least two times by human eyes.
I'll try to strip the spaces and prepare a new version tag with both your and my changes included.
I'm afraid all these problems (also the carriage carnage) apply to the other GT-repositories as well:
@stweil Did you use the latest working BagIt release for inspection?
Since I wonder where the issue with resolution 1x1 arises. I checked the local images used to create the original GT-data for german and they all contained reasonable ImageSize
, XResolution
and YResolution
(300x300 DPI). Probably these information was lost when the image data got stored on our repositories' asset store or when these images got pulled to create the OCR-D-Bag.
While trying to qualify the GT for a training with Tesseract I already noticed a number of issues. See https://github.com/ulb-sachsen-anhalt/ulb-groundtruth-eval-odem-ger/pull/2#issuecomment-2039080148 and https://github.com/stweil/ulb-groundtruth-eval-odem-ger/wiki for some of them.
I compared the GT with results from Tesseract OCR. 20915 of 39823 lines were different (LER = 52.1%). Many of the differences are either OCR errors or transcription errors, but a lot are also caused by different transcription rules. Here are some typical examples:
OCR for historic German text often creates Umlauts with both diaresis and small letter e. The GT is rather good, but also shows two such cases: häͤt (ä and ͤ) and Rebhüͤner (ü and ͤ):
The GT also contains only a small number of wrong transcriptions for the long s (which was transcribed as 's'):