Closed amitdo closed 6 years ago
With our latest model file and current Tesseract the median CER is reduced to 9 %. The execution time is now 635 s. That's much more than twice as fast compared to the old 4.0 results with much better accuracy.
The bad news is that latest Tesseract does not detect any text on two of the 189 pages. That requires a closer examination.
5.0.0-alpha-20201224
still was fine.
@stweil
The bad news is that latest Tesseract does not detect any text on two of the 189 pages.
How do these pages look like?
I guess 470875348_0010.txt is a title page as page 0012 is the preface. Then it's the "title page problem" with maybe large letters in special design, extreme letterspacing etc. It gets better if I cut title pages into line images and the font style is similar to something trained.
Title pages seem to be underrepresented in training sets. It's a sort of selection bias.
How do these pages look like?
See original JPEG files 470875348_0010 and 452117542_0250.
How do these pages look like?
See original JPEG files 470875348_0010 and 452117542_0250.
Tried it with
$ tesseract --version
tesseract 5.0.0-alpha-773-gd33ed
leptonica-1.79.0
$ tesseract 452117542_0250.jpg 452117542_0250.GT4 -l GT4Hist2M \
-c tessedit_write_images=true --tessdata-dir /usr/local/share/tessdata makebox hocr txt
$ tesseract 452117542_0250.jpg 452117542_0250.frak -l ubma/frak2021_0.905_1587027_9141630 \
-c tessedit_write_images=true --tessdata-dir /usr/local/share/tessdata makebox hocr txt
and get for GT4Hist2M
lines words chars
items ocr: 58 204 1047 matches + inserts + substitutions
items grt: 56 198 1039 matches + deletions + substitutions
matches: 36 165 1000 matches
edits: 22 39 55 inserts + deletions + substitutions
subss: 20 33 31 substitutions
inserts: 2 6 16 inserts
deletions: 0 0 8 deletions
precision: 0.6207 0.8088 0.9551 matches / (matches + substitutions + inserts)
recall: 0.6429 0.8333 0.9625 matches / (matches + substitutions + deletions)
accuracy: 0.6207 0.8088 0.9479 matches / (matches + substitutions + inserts + deletions)
f-score: 0.6316 0.8209 0.9588 ( 2 * recall * precision ) / ( recall + precision )
error: 0.3929 0.1970 0.0529 ( inserts + deletions + substitutions ) / (items grt )
and for ubma/frak2021_0.905_1587027_9141630
lines words chars
items ocr: 58 202 1052 matches + inserts + substitutions
items grt: 56 198 1039 matches + deletions + substitutions
matches: 39 173 1014 matches
edits: 19 29 43 inserts + deletions + substitutions
subss: 17 25 20 substitutions
inserts: 2 4 18 inserts
deletions: 0 0 5 deletions
precision: 0.6724 0.8564 0.9639 matches / (matches + substitutions + inserts)
recall: 0.6964 0.8737 0.9759 matches / (matches + substitutions + deletions)
accuracy: 0.6724 0.8564 0.9593 matches / (matches + substitutions + inserts + deletions)
f-score: 0.6842 0.8650 0.9699 ( 2 * recall * precision ) / ( recall + precision )
error: 0.3393 0.1465 0.0414 ( inserts + deletions + substitutions ) / (items grt )
Page 470875348_0010.jpg looks also nice, but I didn't spend time for a GT.txt.
Preprocessing issue?
The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).
The regression (empty page) was introduced by commit 5db92b26aa4cab45f3da6714328c2fcd80891441 and especially the modified declaration for PartSetVector
.
Our GT data is available online: https://digi.bib.uni-mannheim.de/fileadmin/digi/452117542/gt/ https://digi.bib.uni-mannheim.de/fileadmin/digi/470875348/gt/
Tesseract output includes long s and some other characters which must be normalized before the comparision with that ground truth.
The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).
That's not surprising because our training data had no focus on such typefaces. It used GT4HistOCR (mix from early prints until 19th century), Austrian newspapers (mainly Fraktur) and German primers from 19th century (Fraktur and handwritten script).
Another and maybe even more dominant reason is the current quality of the ground truth texts. They were produced by a single person (no 2nd proof reader), and I just noticed that it also includes comments added by that person. In addition it includes numerous violations of our transcription guidelines, for examples blanks before comma and similar issues. So ground truth errors contribute to the CER.
Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.
From #518:
@stweil commented:
@theraysmith commented: