patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Incorrect(Rubbish) output of "eurotext.tif" by manually generated datafiles(as a test) based on "phototest.tif" #73

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Generated  8 datafiles of phototest.tif   
2. All generated as 8 datafiles prefixed with " ph.xxxx ." 
3. Renamed all eng.datafiles as "eng.xxxx 1" and then
(1)Run "tesseract phototest.tif photo -l ph "  -- output of photo.txt  is
Ok without any single mistake.
(2)Run "tesseract eurotext.tif  ph-euro-test -l ph "  --output of 
euro-test.txt  is not correct but rubbish

What is the expected output? What do you see instead?
Output of generated ph.datafiles(based on phototest.tif) should be correct
as in the case of output of (default)eng.datafiles.
Instead output of ph.datafiles are rubbish for  "eurotext.tif"

What version of the product are you using? On what operating system?
 tesseractocr 2.01  XP

Please provide any additional information below.
Since there was  problem in Kannada and Handwritten, as a test check,
generated  8 datafiles of phototest.tif .   
All generated as 8 datafiles prefixed with " ph.xxxx ." .
I Experiment:
 renamed all eng datafiles as "eng.xxxx 1" and then

(1)Run "tesseract phototest.tif photo -l ph "  -- output of photo.txt  is
perfect -without any single mistakes.
(2)Run "tesseract eurotext.tif  ph-euro-test -l ph "  -- output of
euro-test.txt is not correct but rubbish.

II Experiment:
 restored  all eng datafiles as "eng.xxxx " and then
(1)Run "tesseract phototest.tif photo -l eng "  -- output of  photo.txt  is
Ok. 
(2)Run "tesseract eurotext.tif  eng-euro-test2 -l eng "  -- output of
euro-test.txt  is ok - readable
(2)Run "tesseract eurotext.tif ph-euro-test3.ph "  -- output of 
euro-test.txt is NOT readable but rubbish.  

In both cases, words_list /user-words/DanAmbigs were left blank.

It appears that (default)eng.datafiles were built on "phototest.tif" and
output of "eurotext.tif" are correct. Whereas, as a test,manually generated
datafiles based on same tif viz., "phototest.tif" and output of
"eurotext.tif" is rubbish why? How default eng.datafiles were built/generated?

Original issue reported on code.google.com by withbles...@gmail.com on 8 Oct 2007 at 1:31

Attachments:

GoogleCodeExporter commented 9 years ago
This is why OCR isn't as easy as it looks! The two images are in different 
fonts, and 
you need to train on a reasonable spectrum of fonts to get decent accuracy.

Original comment by theraysm...@gmail.com on 28 Dec 2008 at 7:38