[accuracy] 4.0.0 sees white text "Video Mode" on dark grey background as "Vite [ote Cols"

AdamWill commented 5 years ago

Environment

Tesseract Version: 4.0.0
Commit Number: N/A
Platform: Fedora Rawhide

Current Behavior:

Text from attached image ('Video Mode', from a SUSE bootloader screen) is OCRed as 'Vite [ote Cols' by the new default engine in Tesseract 4.0.0 (Fedora Rawhide packaged version). Tesseract 3.05.02 recognizes it correctly. I assume the old engine in 4.0.0 would get it right too, but can't prove it as the Fedora package seems to have some sort of bug preventing it from working.

Expected Behavior:

Obviously, the text should be recognized accurately.

Suggested Fix:

I'm not an OCR developer, I'm afraid :)

AdamWill commented 5 years ago

ocr

AdamWill commented 5 years ago

For the record, I couldn't use the old engine because the Fedora package uses the tessdata_fast data which doesn't work with the old engine. If I put in the tessdata data instead, I can use the old engine, and indeed it parses the text correctly.

AdamWill commented 5 years ago

It seems that 'Vite [ote Cols' is the result you get with tessdata_fast. With tessdata (and the new engine rather than the old), I get '(VEELRY' instead. That's...not better. :) With tessdata_best, the result is correct: "Video Mode".

Still, it seems that tessdata_fast is expected to work - its README says "Most users will want to use these traineddata files to do OCR and these will be shipped as part of Linux distributions eg. Ubuntu 18.04." - so this still seems to be a bug.

cjmayo commented 5 years ago

I've noticed similar results, below.

The one thing I have to add is that if I use get.images config with fast, then process the resulting tessinput.tif with fast I get a better result than best (with extra Page 1):

$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast tessinput.tif stdout
Page 1
RED CAPS
WHITE CAPS
WHITE CAPS

Red light sans italic

Light white. Light white. Light white. Light white. Light white Light white. Light
white. Light white Light white. Light white Light white Light white. Light white

Bold white: light write

Bold white: light write
Bold white: light write

www.boldserif.com

$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/best lightdark.png stdout
=D
WHITE CAPS
WHITE CAPS

Red light sans italic

Light white. Light white Light white Light white Light white Light white Light
white Light white Light white Light white Light white Light white Light white

Bold white: light white

Bold white: light white
Bold white: light white

www.boldserif.com

$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS

(cre OAT @Xel AMC G

Mea NTs Lae a Me ALC ACC
aca Me aie Mi MAUS Mee a a Me UC One nan Me OT

De eT Se acd

De TT Sate aed
ett RTT sed

DADA GRow lay

legacy is slightly better than fast:

$ tesseract --oem 0 --tessdata-dir /usr/local/share/tessdata/legacy lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS

Red ﬁght sans Mia/ff

um wmm um wmc um wmc um wmm um wmm um wmm um
wmc um wmc um wmc um wmm um wmm um wmm um wmm

Bold white hgrt wmc

Bold white hgrt wmc
Buld white hgrt wmc

wwwbolrlserifmom

lightdark

cjmayo commented 1 year ago

With 5.3.0 result with fast is on a par with best:

$ tesseract --oem 1 --tessdata-dir /usr/local/share/tessdata/fast lightdark.png stdout
RED CAPS
WHITE CAPS
WHITE CAPS

Red light sans italic

Light white. Light white.Light white Light white Light white Light white Light
white.Light white. Light white. Light white Light white Light white Light white

Bold white: light white

Bold white: light white
Bold white: light white

www.boldserif.com

AdamWill commented 1 year ago

that sounds promising, I'll have to retry the openQA tests we disabled for this reason...

AdamWill commented 1 year ago

Wahay, indeed, with 5.3.1 the openQA tests pass again, and manually OCRing the test image above gives the right text. Thanks a lot!

tesseract-ocr / tesseract