jacklicn / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Why 11 is recognized as H in tesseract 3.01 #508

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. unzip tessdata.zip  and rename magang.traineddata to eng.traineddata
2. tesseract.exe  p58.bmp  p58
3. tesseract.exe  p17.bmp  p17

What is the expected output? What do you see instead?
  Two pictures p58.bmp and p17.bmp are expected to 
  be recognize as "11 05 11 MG I MG840E 11202935B"

Following is the real output:
1. output file p58.txt context is(good):
      11 05 11 MG I MG840E 11202935B   

2. output file p17.txt context is(abnormal):
      H 05 11I MG I MG840E 11202955B

Two pictures has little difference, p58.bm is solid but p17.bmp
is a little virtual in dash style.
I wonder why "11" is taken as "H" at p17.bmp and how to overcome it.

What version of the product are you using? On what operating system?
  source code is gotten from CS directly at 2011.06.30, maybe it's version 3.01.
  platform is windows XP.

Please provide any additional information below.
    magang.traineddata is my training data with simplex font style.
 p58.bmp and p17.bmp are simplex font graved picture.
     all training related files are included in tessdata.zip.

If recoginzed with tesseract OCR provided training data, the two pictures 
result will be as follow:
ocr training:
p58.bmp:
 11 O5 11 MG I I|G840E 112029355   
p17.bmp:
‘H 05 ‘I1 MG I MG84OE 112029355 
 I don't care these result, but only care the result gotten by my training data. and why "11" is recognized as "H", and how to resolve it.

Thank you very much

Original issue reported on code.google.com by iqy...@163.com on 6 Jul 2011 at 5:23

Attachments:

GoogleCodeExporter commented 9 years ago
try adding 11 to the freq word list. at least that worked for me. but i 
recommend reading unicharambig , ut that didnt worked well for me.

Original comment by sirak2...@gmail.com on 20 Feb 2013 at 4:50