50% of output text are rubbish

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. kan.datafiles generated based on ybkan-16pt.bmp
2. Run "tesseract.exe ybkan-16pt.bmp  ybkan-16pt -l kan "
3. Output (ybkan-16pt.txt) does not reproduce entire yb-kan.bmp , but
partly reproduced 50% with 2%-mistake and remaining 50% rubbish. 

What is the expected output? What do you see instead?
100% should be reproduced from the bmp file. Instead 50% are rubbish.

What version of the product are you using? On what operating system?
tesseract 2.01   XP

Please provide any additional information below.
The issue is similar to phototest.tif issue No:73 above.
If tested with other bmp file(excluding ybkan-16pt.bmp) output are rubbish.

Original issue reported on code.google.com by withbles...@gmail.com on 8 Oct 2007 at 3:23

Attachments:

GoogleCodeExporter commented 9 years ago

In continuation of above, further observations are provided herewith.
Result of tesseractOCR:   In the output of bmp file(ybkan-16pt.bmp) used for 
training
purpose are furnished below:
     Total character sets used:  760 approx.as image
      displayed correctly :      255 in which mistakes: 44 nos.
      Rubbish   ....             505  
Note: I had increased no.of classed to 2560 and len 12  in the source code and
using 3-times of same bmp files - rebuilded.  No improvement from earlier 512
classes. In other words same output when was used 512 classes and after 
increased to
2560 classes. 
what is obvious to me is that 255 characters are displayed correctly 
(disregarding 
44 errors)
As Byte variable holds 0 - 255 so it appears that there are byte values in the
relevant source code of Tesseract that may required to change to integer?. Just
increasing the number of classes did not help 
Detailed re-examination of codes are required
With Regards.

Original comment by withbles...@gmail.com on 9 Oct 2007 at 5:25

GoogleCodeExporter commented 9 years ago

This was fixed in an earlier version.

Original comment by theraysm...@gmail.com on 28 Dec 2008 at 7:42

Changed state: Fixed

shwetapixlr / tesseract-ocr

50% of output text are rubbish #74