patch to enable hOCR output

GoogleCodeExporter commented 9 years ago

Hi,

I propose a patch which implements support for hOCR output format (with
page, line and word bounding boxes).

Additionally this patch causes tesseract to recognize the 'tiff' extension,
even if compiled without leptonica.

Original issue reported on code.google.com by amkryu...@gmail.com on 22 Nov 2009 at 4:31

Attachments:

tesseract-hocr.diff

GoogleCodeExporter commented 9 years ago

Thanks for this. HOCR seems to be a good "standard" format which Cuneiform and 
some
commercial packages support. So I'd rather write code to parse and work with it 
than
any one-off custom output formats...

But... I can't get it to compile after applying your patch.

I grabbed a 3.00 SVN copy of the tesseract code and got it to build earlier. 
Then I
downloaded your patch and applied it followed by a "make clean" and another 
"make"...
Which this time does not complete cleanly.

Thoughts? Fedora12 X86_64

Thanks...

..snip..
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
Making all in testing
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/testing'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/testing'
Making all in java
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/java'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/java'
Making all in api
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/api'
make[3]: Entering directory `/tmp/tesseract-ocr-read-only/api'
g++ -DHAVE_CONFIG_H -I. -I..  -I../ccutil -I../ccstruct -I../image -I../viewer
-I../ccops -I../dict -I../classify -I../ccmain -I../wordrec -I../cutil 
-I../textord
-I/usr/local/include/liblept  -g -O2 -MT baseapi.o -MD -MP -MF 
.deps/baseapi.Tpo -c
-o baseapi.o baseapi.cpp
baseapi.cpp: In function ‘int tesseract::IsParagraphBreak(TBOX, TBOX, int, 
int)’:
baseapi.cpp:712: error: expected ‘;’ before ‘)’ token
make[3]: *** [baseapi.o] Error 1
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/tesseract-ocr-read-only'
make: *** [all] Error 2

Original comment by wdin...@gmail.com on 29 Nov 2009 at 3:30

GoogleCodeExporter commented 9 years ago

Oops, you are right. The line 712 in baseapi.cpp was completely irrelevant and I
wonder why it was there. Anyway, here's the corrected version of the patch.

Original comment by amkryu...@gmail.com on 29 Nov 2009 at 9:09

Attachments:

tesseract-hocr-fixed.patch

GoogleCodeExporter commented 9 years ago

Thanks for the fast reply... Now though... Hmm... How does one activate this 
feature?
Following the example from the FAQ of setting a variable I did this:

I created /usr/share/tesseract/tessdata/configs/hocr
with contents:
tessedit_create_hocr T

and called it like this:
tesseract image.tif outputbase nobatch hocr

to no avail though... 
read_variables_file: Can't open hocr

So... Any pointers?

Thanks...

Original comment by wdin...@gmail.com on 30 Nov 2009 at 2:14

GoogleCodeExporter commented 9 years ago

I think this should work (and actually does work for me). However, since 
tesseract
can't find the file I assume you should have placed it at a wrong location. Are 
you
sure your tessdata directory is /usr/share/tesseract/tessdata/ (and not just
/usr/share/tessdata or /usr/local/share/tessdata/)?

Original comment by amkryu...@gmail.com on 30 Nov 2009 at 5:20

GoogleCodeExporter commented 9 years ago

In my test with hocr2pdf I wound up with decent horizontal placement, but 
inverted
vertical placement. Output from Cuneiform produced a correct looking pdf with
hocr2pdf, which makes me believe that this is a bug in this patch. Is there a 
program
that this output is known to work well with?

Original comment by ere...@gmail.com on 15 Feb 2010 at 6:55

GoogleCodeExporter commented 9 years ago

Ah, you are right. The problem is that in hOCR we should count coordinates from 
the
top right corner, while tesseract puts the coordinate origin at the bottom of 
the
page. So please test this version of the patch.

Original comment by amkryu...@gmail.com on 15 Feb 2010 at 1:40

Attachments:

tesseract-hocr-fixed-bbox.patch

GoogleCodeExporter commented 9 years ago

Results are good! here's my test pdf file. it was created with the svn version 
of
tesseract patched with your bbox patch and hocr2pdf from a page scanned at 
300dpi.

Original comment by ere...@gmail.com on 16 Feb 2010 at 4:46

Attachments:

TestCode.pdf-ocr.pdf

GoogleCodeExporter commented 9 years ago

Applied. Had to remove STL, as it is incompatible with Android.
Thanks.

Original comment by theraysm...@gmail.com on 19 May 2010 at 6:36

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I am using tesseract latest version on ubuntu and running it like this:

tesseract image.tif outputbase nobatch hocr

but get:

cordoval@cordoval-laptop:~/Downloads$ tesseract luis1.jpg luis.txt hocr
read_variables_file: Can't open hocr
Tesseract Open Source OCR Engine with Leptonica
cordoval@cordoval-laptop:~/Downloads$ less luis.txt.txt

Original comment by cordo...@gmail.com on 26 Nov 2010 at 8:26

GoogleCodeExporter commented 9 years ago

read_variables_file: Can't open hocr -> you do not have hocr config file.

Original comment by zde...@gmail.com on 27 Nov 2010 at 8:04

GoogleCodeExporter commented 9 years ago

how do I apply a patch? I only downloaded the file: 
tesseract-hocr-fixed-bbox.patch and I don't know what to do with it... could 
you help me please? regards

Original comment by diox...@gmail.com on 15 Feb 2011 at 11:42

GoogleCodeExporter commented 9 years ago

with program/utility 'patch'. Try to use google.

Original comment by zde...@gmail.com on 16 Feb 2011 at 7:48

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I have installed Bookscanning-Software "Homer" on Windows and had 
"read_variables_file: Can't open hocr" Message in Tesseract-Logfile. Solution: 
Check path-variables in system settings for duplicate tesseract-installations.

Original comment by conra...@gmail.com on 1 Mar 2013 at 7:47

GoogleCodeExporter commented 9 years ago

Hi

I need to add an arabic sakkalmajalla font to tessdata 
how can I do that , can anyone help mw please

Original comment by KhokhaAb...@gmail.com on 7 Feb 2014 at 10:17

kcobra / tesseract-ocr

patch to enable hOCR output #263