Tess 3.01 hocr output not compatible with pdfbeads 1.0.8

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Use Tess hocr output to make html files
2. Use pdfbeads with those html files, get crash

What is the expected output? What do you see instead?
It is expected to work.

What version of the product are you using? On what operating system?
Tess 3.01 and pdfbeads 1.0.8 on Windows 7.

Please provide any additional information below.
<quote> 
https://github.com/steelThread/mimeograph/commit/b29af3338e8f15b22392... 
pdfbeads currently doesn't work with hOCR output generated by 
tesseract v3.01. 
the owner of pdfbeads doesn't want to enhance pdfbeads to work with 
the existing 
tesseract 
hOCR output because tesseract's hOCR output is not properly following 
the hOCR. 
while i totally understand that position, tesseract release about once 
a year 
and mimeo needs to work now 
</quote> 

Can you guys kiss and make up? 

Until then I have forced to make a little hack to pdfbeads to get it 
to read the position 
and word from ocr_word and ocrx_word respectively so that it can read 
the Tess3.01 hocr input.  It seems that pdfbeads is 
expecting both attributes to be in ocrx_word (the way it was in 
Tess3.0?).

Original issue reported on code.google.com by g...@folkplanet.com on 22 May 2012 at 4:11

Attachments:

GoogleCodeExporter commented 9 years ago

I can not reproduce it on linux:

$ tesseract phototest.tif phototest hocr
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0

$ pdfbeads photo* >pdfbeads_phototest.pdf
Prepared data for processing phototest.tif
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfpage.rb:445: 
warning: Insecure world writable dir /mnt/home/test/tess in PATH, mode 040777
JBIG2 compression has been requested, but the encoder is not available.
  I'll use CCITT Group 4 fax compression instead.
Processed phototest.tif

See attachment. 

Can you please post error message from pdfbeads?

Original comment by zde...@gmail.com on 22 May 2012 at 5:50

Attachments:

GoogleCodeExporter commented 9 years ago

I just tried upgrading pdfbeads to 1.0.9.

Changelog says:
pdfbeads 1.0.9
    * Don't attempt to use 'ocrx_word' elements which contain no bounding box
      data (this should fix the problem with the hOCR output produced by some
      tesseract versions).

diff says:
diff pdfbuilder.rb (1.0.8 vs 1.0.9)
467a489
>         next if bbox == [0,0,0,0]

The net effect avoids the crashing.

But it does not use the word-level position information in the pdf,

which means that the guestimation is spread out over the
entire line rather than only within a word, so the highlight
and searching cut/paste still work -- but it looks much funkier.

When I re-apply my patch to 1.0.9, I get word-level accuracy
for the beginning of each word back again.

I looked at the source code and for cuneiform hocr,
which provides char-level positions, pdfbeads just gathers 
the chars in the line back into words if it can.

When outputting the pdf it takes each "word" (which is typically
a word but can be an entire line or just one character depending on 
various circumstances) and outputs the "word" position and text in postscript.
It has to do some fiddling to convert the chars from utf8 to utf16.

Adobe viewer is quite flexible in its searching and will allow me
to search for fragments of words and across word-boundaries,
and even fragments of two words across a boundary.
i.e.
word1
word1 word2
ord1
wo
ord1 wo

Since it is so flexible, maybe it would work to have character-level 
positioning.
Obviously such a hidden text layer would take a bit more disk space 
if it is storing the horizontal position of each letter. 

Until Tess and pdfbeads get together on the hocr output format,
this is not going to improve.  Tess 3.01 at least is stuck with
word-position-averaged-over-the-entire-line.  

For whatever reason, my table-of-contents page in my scanned book pdf
comes out just fine with my patch, but makes horrible highlight blocks
that are wrong without it (too high and wide).  
The contents page has lines like this:
 Some Story    .       .       .       24

Original comment by g...@folkplanet.com on 22 May 2012 at 6:52

GoogleCodeExporter commented 9 years ago

Hello, zde...@gmail,

Thank you for responding so quickly.

The answer is explained in my last post.
You are using 1.0.9 which has a quick hack
in it to keep from choking on ocrx_word' elements 
which contain no bounding box. (Because you
are as of tess 3.01 storing that info in a containing
ocr_word element span)

As far as the error, what was happening is that the 
variable ratio on elements with no bbox was coming out 0,
and then in various places in the code it divides by 
ratio to scale the positions, the divide by 0 produced
an error when .to_i tried to operate on NaN (not a number).

But even though you are no longer crashing with 1.0.9,
and it is better than nothing, it still is not producing
word-level accurate positioning the way it used to in 3.00.

My simple patch at least makes Tess3.01 work with pdfbeads,
and works for both 1.0.8 and 1.0.9.
I do not know if my patch is compatible with 
other versions of Tess, cuneiform, etc,
which is why at this point itʻs just a handy tweak
for those that need it.

Original comment by g...@folkplanet.com on 22 May 2012 at 7:04

GoogleCodeExporter commented 9 years ago

I see that there was earlier discussion about this by Carlos in April:

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d304010010689
20

Original comment by g...@folkplanet.com on 22 May 2012 at 7:17

GoogleCodeExporter commented 9 years ago

Can you create some simple test case (as I did) that will demonstrate problem?

Original comment by zde...@gmail.com on 22 May 2012 at 7:24

GoogleCodeExporter commented 9 years ago

You must use 1.0.8 to see the crash.

If you use 1.0.9 and look at it with your pdf viewer,
you will see the inaccurate word-boundaries.

Original comment by g...@folkplanet.com on 22 May 2012 at 7:32

GoogleCodeExporter commented 9 years ago

I only see mention of ocrx_word only in that hocr document Carlos mentioned.
That document was last updated in 2010.
I do not see ocr_word defined anywhere.
Is it known to be part of the standard?
Does it provide some tremendous benefit?

Original comment by g...@folkplanet.com on 22 May 2012 at 7:35

GoogleCodeExporter commented 9 years ago

I hope people won't be switching to Cuneiform just because
it seems to work better with pdfbeads:

http://scruss.com/blog/category/computers-suck/

Original comment by g...@folkplanet.com on 23 May 2012 at 9:43

GoogleCodeExporter commented 9 years ago

Well ocrx_word is part of hocr spec, so it should not be problem for pdfbeads. 

Other issue is ocr_word (not part of hocr) - I need to find out it this is 
mistake or there is something behind...

Original comment by zde...@gmail.com on 23 May 2012 at 9:51

GoogleCodeExporter commented 9 years ago

I found that after deleting some stray text from an hocr output .html file,
it left an empty word, and even an empty line.

I think that it may be hard for users to figure out
which sections to remove for this kind of post-editing.

So I tweaked the pdfbeads code to simply tolerate and ignore
words or entire lines whose content was equal to the empty string "".
Very easy, works well. 

I found out about it because it was making a pdf that
adobe reader complained about.  After my fix, no complaints.

Original comment by g...@folkplanet.com on 26 May 2012 at 6:58

GoogleCodeExporter commented 9 years ago

Here's my pdf if anyone wants to see it.

http://folkplanet.com/seanchlo/gortoir/GortOir.pdf

Original comment by g...@folkplanet.com on 26 May 2012 at 7:00

GoogleCodeExporter commented 9 years ago

Closing issue because:
a) pdfbeads 1.0.9 works with tesseract-ocr hocr output
b) current svn (r729) fixed several issuer regarding hocr conformity

Original comment by zde...@gmail.com on 29 May 2012 at 9:18

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Here is my pdfbuilder.rb diff.

This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.

Original comment by g...@folkplanet.com on 30 May 2012 at 4:22

Attachments:

pdfbuilder.diff

gnewtothis101 / tesseract-ocr

Tess 3.01 hocr output not compatible with pdfbeads 1.0.8 #711