jacklicn / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tesseract 3.01 has slightly different character recognition when switching from text to box output #552

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run tesseract with makebox and without on attached file
2.
3.

What is the expected output? What do you see instead?
The letters identified by the text vs box runs do not quite match up.  The 
"TSUS Item 950.10D" at the bottom comes out with an O (letter) instead of a 0 
(digit) in the text output in the 950 portion, but the digit using the box 
output, with no other parameter differences.  I would expect the runs using a 
box output to come up with the same characters, given identical other 
parameters, as the regular text output.  There are a couple of scripts on the 
web which try to use both outputs to come up with bounding box info for words 
(such as for DJVU files) and this caused at least one of them to hang.

What version of the product are you using? On what operating system?
Tesseract 3.01, on MacOS X 10.7.    Tesseract 3.00 on the same machine did not 
have this problem.  It's possible one was using LibTiff directly (3.00) and one 
Leptonica.

Please provide any additional information below.
The new hOCR output is probably a better way to get at the needed information, 
but it still seems odd to have different recognition when the output format is 
the only difference.

Original issue reported on code.google.com by clindb...@gmail.com on 28 Sep 2011 at 8:35

Attachments:

GoogleCodeExporter commented 9 years ago
I am afraid this is problem of expectation without understanding of code.
E.g. different parameter run different function (see [1]), that use different 
settings (e.g.  RIL_SYMBOL vs. RIL_PARA) and that could/should produce 
different output...

[1] 
http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp?r=729
#901

Original comment by zde...@gmail.com on 24 Jul 2012 at 9:20