Kannada -boxing problems

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.Run tesseract brh.tif brh batch.nochop makebox
2.generated brh.txt 
3.with help of bb-tesseractV1.0b to draw box on generated brh.txt

What is the expected output? What do you see instead?
Few of image of Kannada fonts were boxed based on brh.txt.
Expected all image of kannada fonts in"brh.tif" file should appear/reflect
in  the generated "brh.txt" file

What version of the product are you using? On what operating system?
 tesseract-2.0  XP

Please provide any additional information below.
It appears that tesseract failed to generate all image fonts with reference
to "brh.tif" in the output file "brh.txt"
(in others words,some/part of fonts in"brh.tif" file only generated in the
file "brh.txt" )
Checked with help of bb-tesseract software - it is observed that boxes were
drawn only with reference to "brh.txt" file - which reflects that tesseract
unable to generate complete set of image Kannada characters/fonts in
"brh.tif"  as "brh.txt"

Original issue reported on code.google.com by withbles...@gmail.com on 26 Aug 2007 at 5:16

Attachments:

bbTesseract.exe
[bbTesseract.exe 2](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-60/comment-0/bbTesseract.exe 2)
[brh log.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-60/comment-0/brh log.txt)
[brh using bbt.PNG](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-60/comment-0/brh using bbt.PNG)
brh.tif
brh.txt
[kannada brh.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-60/comment-0/kannada brh.txt)

GoogleCodeExporter commented 9 years ago

  After testing with different 3 or 4 sample.tif, same problems still exists. In
other words, command line "tesseract xyz.tif xyz batch.nochop makebox" failed to
generate boxes(100%)for full/complete set of fonts image - instead generate 
only 
boxes(25% to 40%) instead of expected 100% boxes of the fonts of image. 

It is felt there must be some bugs in relevant source codes of "batch.nochop 
makebox"
- which requires detailed investigation.

Once tesseractOCR succeeded to generate 100% boxes with reference to fonts 
image at
initial stage,easily one can generate 8 data files of relevant languages 
without any
problems.

Original comment by withbles...@gmail.com on 4 Sep 2007 at 7:09

GoogleCodeExporter commented 9 years ago

These characters are too small to use as training data, or for recognition. You
should be training with characters 20-30 pixels high, equivalent to about 10pt 
at 300
dpi. That is 30-40pt at 75-100 dpi screen resolution. This problem is a 
duplicate of
issue 61.

Original comment by theraysm...@gmail.com on 6 Sep 2007 at 12:51

Changed state: Duplicate

patcharats / tesseract-ocr

Kannada -boxing problems #60