Closed GoogleCodeExporter closed 9 years ago
This is a known problem.
The FindLines code is assuming that each rectangle given to it is composed of a
single size of text, and, although the baseline may be curved, it does not shift
suddenly. While it will often succeed when these rules are broken, there is a
much
higher probability that the text will be lost or just wrong.
The problem of the example you give is unique to forms processing (more or
less), and
although it may be fixed in a future version, it will most likely be a distant
future
version. In the meantime, you could try to cut out rectangles of similar-sized
characters...
Original comment by theraysm...@gmail.com
on 19 Jul 2007 at 3:34
i solved this the same way i solved the "digits" problem issue (#164).
Since you're parsing a form, you probably know where are each element to
recognize on it.
My method is to
- Rotate the form based on an angle automatically detected (i use the black
areas around the scan to
determine two corner's points, then i just do an atan on the coef, it gives me
the angle).
- Crop the garbage generated by rotation, all around the picture (Easy if you
know the angle, and at least 3
corners of the document, i first shear it so the 3 points angle is 90° and
then crop).
- Determine the "type" of your form, if you're processing many types. myself i
do it with colorimetry, and
placemarks annalysis.
- Then, you have to extract each elements, but not using absolute coordinates,
i do use relative to size
coordinates (Basically each set of x/y is a percentage of width/height of the
document).
- Voila. You just extract things, and parse individually with tesseract.
Hope that helps,
Pierre.
Original comment by hicksc...@gmail.com
on 4 Apr 2010 at 11:49
Original comment by theraysm...@gmail.com
on 20 May 2010 at 6:53
Reference to this issue was posted in FAQ
Original comment by zde...@gmail.com
on 2 Jan 2013 at 12:44
Original issue reported on code.google.com by
lohith...@gmail.com
on 19 Jul 2007 at 10:53Attachments: