gnewtothis101 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Devanagari - similar looking glyphs misrecognized #1333

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. run     tesseract with devanagari traineddata on attached image
2.
3.

What is the expected output? What do you see instead?
I expect consistency in output in addition to accuracy. Right now on the same 
image, in same font size etc, tesseract gives two different outputs for same 
shape character.

What version of the product are you using? On what operating system?
latest version from git on windows8 under msys

Please provide any additional information below.

attached tif file and box file generated using box.train
the png file has red rectangles marking the shapes in question.

the glyphs are री and रो  

री 
रो  

U+0930 U+0940 
U+0930 U+094B  

Original issue reported on code.google.com by shreeshrii on 10 Oct 2014 at 8:21

Attachments:

GoogleCodeExporter commented 9 years ago
Another sample page with one word recognized differently - with psm 6

accurate recognition is 'नामावलिः'
recognized text includes that and many other variations such as:
नामावतिःन् 
नामावलिः>
नामावळिः
नामावतिः
नामावठिः
नामावठिः '

Original comment by shreeshrii on 12 Oct 2014 at 3:18

Attachments:

GoogleCodeExporter commented 9 years ago
Does it happen when you create a unicharambigs with those comparisons between 
erroneous outputs?

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

The last file (unicharambigs)
The final data file that Tesseract uses is called unicharambigs. It describes 
possible ambiguities between characters or sets of characters, and is manually 
generated. To understand the file format, look at the following example:

Example line    Explanation
2 ' ' 1 " 1 A double quote (") should be substituted whenever 2 consecutive 
single quotes (') are seen.
1 m 2 r n 0 The characters 'rn' may sometimes be recognized incorrectly as 'm'.
3 i i i 1 m 0   The character 'm' may sometimes be recognized incorrectly as the 
sequence 'iii'.

Original comment by dalbirsi...@googlemail.com on 4 Feb 2015 at 11:28