tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.46k stars 9.53k forks source link

Individual character analysis is rotated #253

Closed matthill closed 8 years ago

matthill commented 8 years ago

I put together a simple test program that demonstrates the issue: https://github.com/matthill/tesseract_z_issue/tree/master

Given an input image that is a "N", Tesseract seems to rotate this single character to produce high confidence for "Z" and "2" characters. Only the "N" and "M" characters would be expected.

tesseract_z_issue

Z Character

Input training data is here: tif box

After running the program, the output is:

[mhill@mhill-linux z]$ ./tesseract_z_test 
Z : 95.4505
symbol Z, conf: 95.450462 font: netherlands (index 1) size 53px - Z conf: 95.450462
            - N conf: 91.517166
            - 2 conf: 86.062859
            - M conf: 81.259239
---------------------------------------------

Z and 2 characters are not expected, it makes me wonder if the character is rotated when analyzed.

zdenop commented 8 years ago

Problem is that you did not followed instruction so you are alone with your problem.

Also it looks like your font seem to be common. In such case training is useless (from community experience) - nobody was able to reach quality provided by google traineddata.

matthill commented 8 years ago

Are you sure that you're not closing a real issue here? I assume that Tesseract is trying to rotate an individual character crop 90 degrees and getting better recognition. I assume that changing the training data would not affect this. Do you believe that changing the training input data would resolve this issue?

I did not use a font to train this language, these are actual binarized letter samples from real data (license plates). I'm not sure I understand your comment "font seem to be common." Can you help me understand what you mean?

My use case is individual character recognition, rather than words/lines of characters. The recognition is only used for one character at a time (no segmentation) so I don't believe the order of the characters in the tif/box matters. Am I mistaken? Do you expect that changing the order of the characters could affect individual character recognition? My experimentation with Tesseract leads me to believe that this is not true.

zdenop commented 8 years ago

First I all: "you assume" and you ask as to prove it ;-). If you want to create acceptable issue, please use google traineddata and not modified tesseract.

Next: tesseract is known that it requires to follow training procedure strictly. If you decide to not follow it, please do not create issue, but use tesseract user forum instead.

Next: Community experience is that if your font looks common (e.g. it is very similar to fonts listed e.g. in https://github.com/tesseract-ocr/tessdata/blob/master/eng.cube.size) training is waste of time (you will get worse results). You should focus on image preprocessing instead.

amitdo commented 8 years ago

With the default eng.traineddata I get:

tesseract z.png - -psm 10
N

So use the default traineddata and Improve the input image if needed.

amitdo commented 8 years ago

... fonts listed e.g. in https://github.com/tesseract-ocr/tessdata/blob/master/eng.cube.size

A better link: https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

List of fonts to train on LATIN_FONTS=( "Arial Bold" \ "Arial Bold Italic" \ "Arial Italic" \ "Arial" \ "Courier New Bold" \ "Courier New Bold Italic" \ "Courier New Italic" \ "Courier New" \ "Times New Roman, Bold" \ "Times New Roman, Bold Italic" \ "Times New Roman, Italic" \ "Times New Roman," \ "Georgia Bold" \ "Georgia Italic" \ "Georgia" \ "Georgia Bold Italic" \ "Trebuchet MS Bold" \ "Trebuchet MS Bold Italic" \ "Trebuchet MS Italic" \ "Trebuchet MS" \ "Verdana Bold" \ "Verdana Italic" \ "Verdana" \ "Verdana Bold Italic" \ "URW Bookman L Bold" \ "URW Bookman L Italic" \ "URW Bookman L Bold Italic" \ "Century Schoolbook L Bold" \ "Century Schoolbook L Italic" \ "Century Schoolbook L Bold Italic" \ "Century Schoolbook L Medium" \ "DejaVu Sans Ultra-Light" \ )