In-complete OCR result - Githubissues

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. ./tesseract 1.tiff output
2.
3.

What is the expected output? What do you see instead?
Expected output: 11 Slate/Payer`s state no. MA 08-61738
Seen output: 11 State/Payc-1r`s stale no. MA

What version of the product are you using? On what operating system?
tesseract revision: 2.00
OS: fedora core 6

Please provide any additional information below.
Output do not show OCR result for the numbers after MA in the
image(1.tiff). It is just blank( not even incorrect result ).
If i cut just the numbers after the MA text in to a separate image (2.tiff
)i get the correct result for the numbers.

I get same results with version 1.03 also. 

I tried to debug and find the reason.. here are my observations, but could
not rectify it.

TessBaseAPI::TesseractRect()-->RecognizeToString()
                                   |-->FindLines(&block_list)
                                   |
                                   |---->Recognize(&block_list, NULL)

When printed the block list after FindLines(&block_list); 
Werds which did not have the ocr results had rejected all the blobs in it.
Here is the dump of one such WERD.
Blanks= 1
Bounding box=(471,24)->(571,55)
Flags = 0 = 00
   W_SEGMENTED = FALSE 
   W_ITALIC = FALSE 
   W_BOL = FALSE 
   W_EOL = FALSE 
   W_NORMALIZED = FALSE 
   W_POLYGON = FALSE 
   W_LINEARC = FALSE 
   W_DONT_CHOP = FALSE 
   W_REP_CHAR = FALSE 
   W_FUZZY_SP = FALSE 
   W_FUZZY_NON = FALSE 
Correct= 
Rejected cblob count = 5

This werd had 5 blobs and all of them were rejected.( do not know why.. ).

After this when Recognize(&block_list, NULL) is called for the recognizing
the WERDS it does not consider the werds with all rejected blobs. and gives
blank string for such werds.

Original issue reported on code.google.com by lohith...@gmail.com on 19 Jul 2007 at 10:53

Attachments:

GoogleCodeExporter commented 9 years ago

This is a known problem.

The FindLines code is assuming that each rectangle given to it is composed of a
single size of text, and, although the baseline may be curved, it does not shift
suddenly. While it will often succeed when these rules are broken, there is a 
much
higher probability that the text will be lost or just wrong.

The problem of the example you give is unique to forms processing (more or 
less), and
although it may be fixed in a future version, it will most likely be a distant 
future
version. In the meantime, you could try to cut out rectangles of similar-sized
characters...

Original comment by theraysm...@gmail.com on 19 Jul 2007 at 3:34

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

i solved this the same way i solved the "digits" problem issue (#164).
Since you're parsing a form, you probably know where are each element to 
recognize on it.
My method is to 
- Rotate the form based on an angle automatically detected (i use the black 
areas around the scan to 
determine two corner's points, then i just do an atan on the coef, it gives me 
the angle).
- Crop the garbage generated by rotation, all around the picture (Easy if you 
know the angle, and at least 3 
corners of the document, i first shear it so the 3 points angle is 90° and 
then crop).
- Determine the "type" of your form, if you're processing many types. myself i 
do it with colorimetry, and 
placemarks annalysis.
- Then, you have to extract each elements, but not using absolute coordinates, 
i do use relative to size 
coordinates (Basically each set of x/y is a percentage of width/height of the 
document).
- Voila. You just extract things, and parse individually with tesseract.

Hope that helps,
Pierre.

Original comment by hicksc...@gmail.com on 4 Apr 2010 at 11:49

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 20 May 2010 at 6:53

Changed state: Look-here-for-help

GoogleCodeExporter commented 9 years ago

Reference to this issue was posted in FAQ

Original comment by zde...@gmail.com on 2 Jan 2013 at 12:44

Changed state: No-longer-an-issue

patcharats / tesseract-ocr

In-complete OCR result #44