Closed AdamWill closed 1 year ago
For the record, I couldn't use the old engine because the Fedora package uses the tessdata_fast data which doesn't work with the old engine. If I put in the tessdata data instead, I can use the old engine, and indeed it parses the text correctly.
It seems that 'Vite [ote Cols' is the result you get with tessdata_fast. With tessdata (and the new engine rather than the old), I get '(VEELRY' instead. That's...not better. :) With tessdata_best, the result is correct: "Video Mode".
Still, it seems that tessdata_fast is expected to work - its README says "Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04." - so this still seems to be a bug.
I've noticed similar results, below.
The one thing I have to add is that if I use get.images
config with fast, then process the resulting tessinput.tif with fast I get a better result than best (with extra Page 1):
$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast tessinput.tif stdout
Page 1
RED CAPS
WHITE CAPS
WHITE CAPS
Red light sans italic
Light white. Light white. Light white. Light white. Light white Light white. Light
white. Light white Light white. Light white Light white Light white. Light white
Bold white: light write
Bold white: light write
Bold white: light write
www.boldserif.com
$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/best lightdark.png stdout
=D
WHITE CAPS
WHITE CAPS
Red light sans italic
Light white. Light white Light white Light white Light white Light white Light
white Light white Light white Light white Light white Light white Light white
Bold white: light white
Bold white: light white
Bold white: light white
www.boldserif.com
$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS
(cre OAT @Xel AMC G
Mea NTs Lae a Me ALC ACC
aca Me aie Mi MAUS Mee a a Me UC One nan Me OT
De eT Se acd
De TT Sate aed
ett RTT sed
DADA GRow lay
legacy is slightly better than fast:
$ tesseract --oem 0 --tessdata-dir /usr/local/share/tessdata/legacy lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS
Red fight sans Mia/ff
um wmm um wmc um wmc um wmm um wmm um wmm um
wmc um wmc um wmc um wmm um wmm um wmm um wmm
Bold white hgrt wmc
Bold white hgrt wmc
Buld white hgrt wmc
wwwbolrlserifmom
With 5.3.0 result with fast is on a par with best:
$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS
Red light sans italic
Light white. Light white.Light white Light white Light white Light white Light
white.Light white. Light white. Light white Light white Light white Light white
Bold white: light white
Bold white: light white
Bold white: light white
www.boldserif.com
that sounds promising, I'll have to retry the openQA tests we disabled for this reason...
Wahay, indeed, with 5.3.1 the openQA tests pass again, and manually OCRing the test image above gives the right text. Thanks a lot!
Environment
Current Behavior:
Text from attached image ('Video Mode', from a SUSE bootloader screen) is OCRed as 'Vite [ote Cols' by the new default engine in Tesseract 4.0.0 (Fedora Rawhide packaged version). Tesseract 3.05.02 recognizes it correctly. I assume the old engine in 4.0.0 would get it right too, but can't prove it as the Fedora package seems to have some sort of bug preventing it from working.
Expected Behavior:
Obviously, the text should be recognized accurately.
Suggested Fix:
I'm not an OCR developer, I'm afraid :)