tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.09k stars 9.5k forks source link

RFC: Remove the legacy OCR Engine #707

Closed amitdo closed 6 years ago

amitdo commented 7 years ago

Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

From #518:

@stweil commented:

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

@theraysmith commented:

Please provide examples of where you get better results with the old engine. Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

stweil commented 3 years ago

With our latest model file and current Tesseract the median CER is reduced to 9 %. The execution time is now 635 s. That's much more than twice as fast compared to the old 4.0 results with much better accuracy.

The bad news is that latest Tesseract does not detect any text on two of the 189 pages. That requires a closer examination.

5.0.0-alpha-20201224 still was fine.

wollmers commented 3 years ago

@stweil

The bad news is that latest Tesseract does not detect any text on two of the 189 pages.

How do these pages look like?

I guess 470875348_0010.txt is a title page as page 0012 is the preface. Then it's the "title page problem" with maybe large letters in special design, extreme letterspacing etc. It gets better if I cut title pages into line images and the font style is similar to something trained.

Title pages seem to be underrepresented in training sets. It's a sort of selection bias.

stweil commented 3 years ago

How do these pages look like?

See original JPEG files 470875348_0010 and 452117542_0250.

wollmers commented 3 years ago

How do these pages look like?

See original JPEG files 470875348_0010 and 452117542_0250.

Tried it with

$ tesseract --version
tesseract 5.0.0-alpha-773-gd33ed
 leptonica-1.79.0

$ tesseract 452117542_0250.jpg 452117542_0250.GT4  -l GT4Hist2M  \ 
-c tessedit_write_images=true  --tessdata-dir /usr/local/share/tessdata  makebox hocr txt

$ tesseract 452117542_0250.jpg 452117542_0250.frak  -l ubma/frak2021_0.905_1587027_9141630 \ 
-c tessedit_write_images=true --tessdata-dir /usr/local/share/tessdata  makebox hocr txt

and get for GT4Hist2M

              lines   words   chars
items ocr:       58     204    1047 matches + inserts + substitutions
items grt:       56     198    1039 matches + deletions + substitutions
matches:         36     165    1000 matches
edits:           22      39      55 inserts + deletions + substitutions
 subss:          20      33      31 substitutions
 inserts:         2       6      16 inserts
 deletions:       0       0       8 deletions
precision:   0.6207  0.8088  0.9551 matches / (matches + substitutions + inserts)
recall:      0.6429  0.8333  0.9625 matches / (matches + substitutions + deletions)
accuracy:    0.6207  0.8088  0.9479 matches / (matches + substitutions + inserts + deletions)
f-score:     0.6316  0.8209  0.9588 ( 2 * recall * precision ) / ( recall + precision )
error:       0.3929  0.1970  0.0529 ( inserts + deletions + substitutions ) / (items grt )

and for ubma/frak2021_0.905_1587027_9141630

              lines   words   chars
items ocr:       58     202    1052 matches + inserts + substitutions
items grt:       56     198    1039 matches + deletions + substitutions
matches:         39     173    1014 matches
edits:           19      29      43 inserts + deletions + substitutions
 subss:          17      25      20 substitutions
 inserts:         2       4      18 inserts
 deletions:       0       0       5 deletions
precision:   0.6724  0.8564  0.9639 matches / (matches + substitutions + inserts)
recall:      0.6964  0.8737  0.9759 matches / (matches + substitutions + deletions)
accuracy:    0.6724  0.8564  0.9593 matches / (matches + substitutions + inserts + deletions)
f-score:     0.6842  0.8650  0.9699 ( 2 * recall * precision ) / ( recall + precision )
error:       0.3393  0.1465  0.0414 ( inserts + deletions + substitutions ) / (items grt )

Page 470875348_0010.jpg looks also nice, but I didn't spend time for a GT.txt.

Preprocessing issue?

The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).

stweil commented 3 years ago

The regression (empty page) was introduced by commit 5db92b26aa4cab45f3da6714328c2fcd80891441 and especially the modified declaration for PartSetVector.

Our GT data is available online: https://digi.bib.uni-mannheim.de/fileadmin/digi/452117542/gt/ https://digi.bib.uni-mannheim.de/fileadmin/digi/470875348/gt/

Tesseract output includes long s and some other characters which must be normalized before the comparision with that ground truth.

stweil commented 3 years ago

The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).

That's not surprising because our training data had no focus on such typefaces. It used GT4HistOCR (mix from early prints until 19th century), Austrian newspapers (mainly Fraktur) and German primers from 19th century (Fraktur and handwritten script).

Another and maybe even more dominant reason is the current quality of the ground truth texts. They were produced by a single person (no 2nd proof reader), and I just noticed that it also includes comments added by that person. In addition it includes numerous violations of our transcription guidelines, for examples blanks before comma and similar issues. So ground truth errors contribute to the CER.