Closed GoogleCodeExporter closed 9 years ago
Thanks for this. HOCR seems to be a good "standard" format which Cuneiform and
some
commercial packages support. So I'd rather write code to parse and work with it
than
any one-off custom output formats...
But... I can't get it to compile after applying your patch.
I grabbed a 3.00 SVN copy of the tesseract code and got it to build earlier.
Then I
downloaded your patch and applied it followed by a "make clean" and another
"make"...
Which this time does not complete cleanly.
Thoughts? Fedora12 X86_64
Thanks...
..snip..
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/tessdata'
Making all in testing
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/testing'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/testing'
Making all in java
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/java'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/java'
Making all in api
make[2]: Entering directory `/tmp/tesseract-ocr-read-only/api'
make[3]: Entering directory `/tmp/tesseract-ocr-read-only/api'
g++ -DHAVE_CONFIG_H -I. -I.. -I../ccutil -I../ccstruct -I../image -I../viewer
-I../ccops -I../dict -I../classify -I../ccmain -I../wordrec -I../cutil
-I../textord
-I/usr/local/include/liblept -g -O2 -MT baseapi.o -MD -MP -MF
.deps/baseapi.Tpo -c
-o baseapi.o baseapi.cpp
baseapi.cpp: In function ‘int tesseract::IsParagraphBreak(TBOX, TBOX, int,
int)’:
baseapi.cpp:712: error: expected ‘;’ before ‘)’ token
make[3]: *** [baseapi.o] Error 1
make[3]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/tmp/tesseract-ocr-read-only/api'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/tesseract-ocr-read-only'
make: *** [all] Error 2
Original comment by wdin...@gmail.com
on 29 Nov 2009 at 3:30
Oops, you are right. The line 712 in baseapi.cpp was completely irrelevant and I
wonder why it was there. Anyway, here's the corrected version of the patch.
Original comment by amkryu...@gmail.com
on 29 Nov 2009 at 9:09
Attachments:
Thanks for the fast reply... Now though... Hmm... How does one activate this
feature?
Following the example from the FAQ of setting a variable I did this:
I created /usr/share/tesseract/tessdata/configs/hocr
with contents:
tessedit_create_hocr T
and called it like this:
tesseract image.tif outputbase nobatch hocr
to no avail though...
read_variables_file: Can't open hocr
So... Any pointers?
Thanks...
Original comment by wdin...@gmail.com
on 30 Nov 2009 at 2:14
I think this should work (and actually does work for me). However, since
tesseract
can't find the file I assume you should have placed it at a wrong location. Are
you
sure your tessdata directory is /usr/share/tesseract/tessdata/ (and not just
/usr/share/tessdata or /usr/local/share/tessdata/)?
Original comment by amkryu...@gmail.com
on 30 Nov 2009 at 5:20
In my test with hocr2pdf I wound up with decent horizontal placement, but
inverted
vertical placement. Output from Cuneiform produced a correct looking pdf with
hocr2pdf, which makes me believe that this is a bug in this patch. Is there a
program
that this output is known to work well with?
Original comment by ere...@gmail.com
on 15 Feb 2010 at 6:55
Ah, you are right. The problem is that in hOCR we should count coordinates from
the
top right corner, while tesseract puts the coordinate origin at the bottom of
the
page. So please test this version of the patch.
Original comment by amkryu...@gmail.com
on 15 Feb 2010 at 1:40
Attachments:
Results are good! here's my test pdf file. it was created with the svn version
of
tesseract patched with your bbox patch and hocr2pdf from a page scanned at
300dpi.
Original comment by ere...@gmail.com
on 16 Feb 2010 at 4:46
Attachments:
Applied. Had to remove STL, as it is incompatible with Android.
Thanks.
Original comment by theraysm...@gmail.com
on 19 May 2010 at 6:36
I am using tesseract latest version on ubuntu and running it like this:
tesseract image.tif outputbase nobatch hocr
but get:
cordoval@cordoval-laptop:~/Downloads$ tesseract luis1.jpg luis.txt hocr
read_variables_file: Can't open hocr
Tesseract Open Source OCR Engine with Leptonica
cordoval@cordoval-laptop:~/Downloads$ less luis.txt.txt
Original comment by cordo...@gmail.com
on 26 Nov 2010 at 8:26
read_variables_file: Can't open hocr -> you do not have hocr config file.
Original comment by zde...@gmail.com
on 27 Nov 2010 at 8:04
how do I apply a patch? I only downloaded the file:
tesseract-hocr-fixed-bbox.patch and I don't know what to do with it... could
you help me please? regards
Original comment by diox...@gmail.com
on 15 Feb 2011 at 11:42
with program/utility 'patch'. Try to use google.
Original comment by zde...@gmail.com
on 16 Feb 2011 at 7:48
[deleted comment]
I have installed Bookscanning-Software "Homer" on Windows and had
"read_variables_file: Can't open hocr" Message in Tesseract-Logfile. Solution:
Check path-variables in system settings for duplicate tesseract-installations.
Original comment by conra...@gmail.com
on 1 Mar 2013 at 7:47
Hi
I need to add an arabic sakkalmajalla font to tessdata
how can I do that , can anyone help mw please
Original comment by KhokhaAb...@gmail.com
on 7 Feb 2014 at 10:17
Original issue reported on code.google.com by
amkryu...@gmail.com
on 22 Nov 2009 at 4:31Attachments: