Feature request: An option to output position of text

patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

Feature request: An option to output position of text #53

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

It would be really useful to be able to see which area of the image
produced which text.

I am the author of gscan2pdf (http://gscan2pdf.sourceforge.net/), which can
use tesseract on an image before embedding the output in a PDF or DjVu. It
would be very useful to be able to embed each word more or less where the
original was in the scan.

Perhaps the output from tesseract could be an XML file:

<text><position>10 10 500 50</position>word</text>

Original issue reported on code.google.com by jeffrey....@gmail.com on 6 Aug 2007 at 7:03

GoogleCodeExporter commented 9 years ago

Maybe you can use the generated box file, as it includes char and choords 
(left,bottom,right,top). But you only get each letter and not the whole word.

Original comment by struther...@gmail.com on 6 Aug 2007 at 11:15

GoogleCodeExporter commented 9 years ago

That looks very promising. Is there any way of getting both the normal text 
output
and the box file without running tesseract twice? I'm using:

tesseract image.tif text batch.nochop makebox

Original comment by jeffrey....@gmail.com on 6 Aug 2007 at 11:52

GoogleCodeExporter commented 9 years ago

i don't think this is currently included. I can't code c++ but maybe it is 
possible 
to add that function with a parameter at the commandline.
Boxes must be generated for both, creating boxfile and creating textouput, i 
think.

Original comment by struther...@gmail.com on 6 Aug 2007 at 1:13

GoogleCodeExporter commented 9 years ago

OK. So my feature request should really read:

Please add option to get both the normal text output and the box file without 
running
tesseract twice.

This will allow me to use tesseract's word breaks (from the normal text output)
without having to guess my own from the box file, and also to correctly 
position the
text output, a word at a time, in the PDF or DjVu file at approximately the 
right
font size.

Original comment by jeffrey....@gmail.com on 6 Aug 2007 at 1:27

GoogleCodeExporter commented 9 years ago

See also Issue 59. Somebody will get to this soon. It is quite easy. On the 
other
hand, someone from ocropus may get round to hocr output soon too.

Original comment by theraysm...@gmail.com on 6 Sep 2007 at 1:07

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Isn't it simply a matter of doing it all in a batch/shell file!! Or are you 
running 
100's of commands per day?

Original comment by beaumon...@gmail.com on 12 Sep 2007 at 3:01

GoogleCodeExporter commented 9 years ago

gscan2pdf is running tesseract on the fly. It seems silly to run tesseract 
twice when
it should be relatively straightforward to modify tesseract to produce the 
required
output.

Original comment by jeffrey....@gmail.com on 12 Sep 2007 at 6:18

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 30 Dec 2008 at 9:37

Changed state: Duplicate