gnewtothis101 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tess3.01 sometimes fails to handle curly quotes correctly. #708

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Use Tess3.01
2. Use the provided .zip which has traineddata and image and box files.
3. The image whose ocr fails due to mis-parsing curly-quotes
at the top of the page, and also further down it happens again.

What is the expected output? What do you see instead?
It is supposed to parse the curly
quotes with the first line of the input scan GortOir0005_1L.tif
and not the gibberish that we actually get.

What version of the product are you using? On what operating system?
Tess 3.01 on Windows 7.

Please provide any additional information below.
This is the problem I referred to before as the "High curly quotes"
font problem.  This Irish font has capital letters that extend high,
for instance the stem of the capital-b which looks just like
a bigger lower-case b.  Then it has to have room for a dot diacritical
over that consonant.  So the total lineheight is a little higher than 
most fonts and languages would have.  And the font designer decided to 
make the curly double quotes touching that highest point, so they are 
a little higher than even some other Irish fonts typically have. 

If I use gimp to edit that first screwed up line,
I can lower those left and right curly quotes about 4 to 10 pixels 
(the images are in 600dpi), and then Tess can parse the layout ok.

It would be great if you could fix this.
Of my 75 page book, only 3 pages had problems
with the curly quotes, the rest parsed the curlys ok.

Original issue reported on code.google.com by g...@folkplanet.com on 18 May 2012 at 5:12

Attachments:

GoogleCodeExporter commented 9 years ago

Nick seems to have found a setting that might help!

On Fri, Jun 01, 2012 at 10:16:52AM +0100, Nick White wrote:
> On Wed, May 23, 2012 at 05:39:00PM +0100, Nick White wrote:
> > On Tue, May 22, 2012 at 05:21:23AM -0700, Galt wrote:
> > > On May 21, 2:04�am, Nick White <nick.wh...@durham.ac.uk> wrote:
> > > > I've been suffering a very similar problem with some of the text I'm
> > > > training, which has several diacritics above and below glyphs. It
> > > > isn't infrequent to find quite a few lines of garbage which are some
> > > > of the diacritics taking a line, which then causes the following and
> > > > preceding lines to not include said diacritics.
>
> I wonder, is there any way of harnessing the Tesseract API or
> configuration options to affect line height and line detection? I
> can't seem to make the above problem go away.

I finally solved this problem for my case! I found the configuration
setting 'textord_min_linesize'. With this I can assure Tesseract
that lines the size of accents should never be considered, and the
problem goes away entirely. I set the value to 2.5, twice the
default, after trial-and-error.

Nick 

Original comment by g...@folkplanet.com on 23 Jul 2012 at 5:46

GoogleCodeExporter commented 9 years ago
I will put this hint to FAQ.

Original comment by zde...@gmail.com on 23 Jul 2012 at 10:07