Closed GoogleCodeExporter closed 9 years ago
I can not reproduce it on linux:
$ tesseract phototest.tif phototest hocr
Tesseract Open Source OCR Engine v3.01 with Leptonica
Page 0
$ pdfbeads photo* >pdfbeads_phototest.pdf
Prepared data for processing phototest.tif
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfpage.rb:445:
warning: Insecure world writable dir /mnt/home/test/tess in PATH, mode 040777
JBIG2 compression has been requested, but the encoder is not available.
I'll use CCITT Group 4 fax compression instead.
Processed phototest.tif
See attachment.
Can you please post error message from pdfbeads?
Original comment by zde...@gmail.com
on 22 May 2012 at 5:50
Attachments:
I just tried upgrading pdfbeads to 1.0.9.
Changelog says:
pdfbeads 1.0.9
* Don't attempt to use 'ocrx_word' elements which contain no bounding box
data (this should fix the problem with the hOCR output produced by some
tesseract versions).
diff says:
diff pdfbuilder.rb (1.0.8 vs 1.0.9)
467a489
> next if bbox == [0,0,0,0]
The net effect avoids the crashing.
But it does not use the word-level position information in the pdf,
which means that the guestimation is spread out over the
entire line rather than only within a word, so the highlight
and searching cut/paste still work -- but it looks much funkier.
When I re-apply my patch to 1.0.9, I get word-level accuracy
for the beginning of each word back again.
I looked at the source code and for cuneiform hocr,
which provides char-level positions, pdfbeads just gathers
the chars in the line back into words if it can.
When outputting the pdf it takes each "word" (which is typically
a word but can be an entire line or just one character depending on
various circumstances) and outputs the "word" position and text in postscript.
It has to do some fiddling to convert the chars from utf8 to utf16.
Adobe viewer is quite flexible in its searching and will allow me
to search for fragments of words and across word-boundaries,
and even fragments of two words across a boundary.
i.e.
word1
word1 word2
ord1
wo
ord1 wo
Since it is so flexible, maybe it would work to have character-level
positioning.
Obviously such a hidden text layer would take a bit more disk space
if it is storing the horizontal position of each letter.
Until Tess and pdfbeads get together on the hocr output format,
this is not going to improve. Tess 3.01 at least is stuck with
word-position-averaged-over-the-entire-line.
For whatever reason, my table-of-contents page in my scanned book pdf
comes out just fine with my patch, but makes horrible highlight blocks
that are wrong without it (too high and wide).
The contents page has lines like this:
Some Story . . . 24
Original comment by g...@folkplanet.com
on 22 May 2012 at 6:52
Hello, zde...@gmail,
Thank you for responding so quickly.
The answer is explained in my last post.
You are using 1.0.9 which has a quick hack
in it to keep from choking on ocrx_word' elements
which contain no bounding box. (Because you
are as of tess 3.01 storing that info in a containing
ocr_word element span)
As far as the error, what was happening is that the
variable ratio on elements with no bbox was coming out 0,
and then in various places in the code it divides by
ratio to scale the positions, the divide by 0 produced
an error when .to_i tried to operate on NaN (not a number).
But even though you are no longer crashing with 1.0.9,
and it is better than nothing, it still is not producing
word-level accurate positioning the way it used to in 3.00.
My simple patch at least makes Tess3.01 work with pdfbeads,
and works for both 1.0.8 and 1.0.9.
I do not know if my patch is compatible with
other versions of Tess, cuneiform, etc,
which is why at this point itʻs just a handy tweak
for those that need it.
Original comment by g...@folkplanet.com
on 22 May 2012 at 7:04
I see that there was earlier discussion about this by Carlos in April:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/6d304010010689
20
Original comment by g...@folkplanet.com
on 22 May 2012 at 7:17
Can you create some simple test case (as I did) that will demonstrate problem?
Original comment by zde...@gmail.com
on 22 May 2012 at 7:24
You must use 1.0.8 to see the crash.
If you use 1.0.9 and look at it with your pdf viewer,
you will see the inaccurate word-boundaries.
Original comment by g...@folkplanet.com
on 22 May 2012 at 7:32
I only see mention of ocrx_word only in that hocr document Carlos mentioned.
That document was last updated in 2010.
I do not see ocr_word defined anywhere.
Is it known to be part of the standard?
Does it provide some tremendous benefit?
Original comment by g...@folkplanet.com
on 22 May 2012 at 7:35
I hope people won't be switching to Cuneiform just because
it seems to work better with pdfbeads:
http://scruss.com/blog/category/computers-suck/
Original comment by g...@folkplanet.com
on 23 May 2012 at 9:43
Well ocrx_word is part of hocr spec, so it should not be problem for pdfbeads.
Other issue is ocr_word (not part of hocr) - I need to find out it this is
mistake or there is something behind...
Original comment by zde...@gmail.com
on 23 May 2012 at 9:51
I found that after deleting some stray text from an hocr output .html file,
it left an empty word, and even an empty line.
I think that it may be hard for users to figure out
which sections to remove for this kind of post-editing.
So I tweaked the pdfbeads code to simply tolerate and ignore
words or entire lines whose content was equal to the empty string "".
Very easy, works well.
I found out about it because it was making a pdf that
adobe reader complained about. After my fix, no complaints.
Original comment by g...@folkplanet.com
on 26 May 2012 at 6:58
Here's my pdf if anyone wants to see it.
http://folkplanet.com/seanchlo/gortoir/GortOir.pdf
Original comment by g...@folkplanet.com
on 26 May 2012 at 7:00
Closing issue because:
a) pdfbeads 1.0.9 works with tesseract-ocr hocr output
b) current svn (r729) fixed several issuer regarding hocr conformity
Original comment by zde...@gmail.com
on 29 May 2012 at 9:18
Here is my pdfbuilder.rb diff.
This contains my fixes to use Tess3.01-specific hocr output
with crisp word-start boundaries,
as well as tolerate empty word or line in hocr output.
Original comment by g...@folkplanet.com
on 30 May 2012 at 4:22
Attachments:
Original issue reported on code.google.com by
g...@folkplanet.com
on 22 May 2012 at 4:11Attachments: