raffaeldantas / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

Bug with text2image or pango #1336

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi there, when generating some box files for training with text2image I noticed 
the following issue. When we have uppercase accented characters on the first 
line, the box cuts off half the accent probably due to page boundary issues. 
When a blank line is inserted before that it appears to render the boxes 
correctly. This can be reproduced by creating a text file like:

İÇİNDE

text2image --leading=8 --fonts_dir=fonts --box_padding=0 
--strip_unrenderable_words --char_spacing=0.0 --exposure=0 --font='Calibri 
Bold' --outputbase=out --text=file --ligatures=1 --degrade_imag=0

and then viewing in a box editor (I'm using jTessBoxEditor)

When a blank line is inserted above this word in the file the boxes render 
correctly.

Mark

Original issue reported on code.google.com by zea...@gmail.com on 10 Oct 2014 at 6:57

GoogleCodeExporter commented 9 years ago
Hello, I am not sure if I got your report correctly. Please check attached test 
case.
I used free font 'Roboto Bold' font instead of 'Calibri Bold'.
And this out.box file seams to be correct.

Original comment by zde...@gmail.com on 22 Apr 2015 at 5:22

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 2 May 2015 at 8:18