internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
100 stars 14 forks source link

Recode does not merge hocr into pdf #69

Open jcuenod opened 1 year ago

jcuenod commented 1 year ago

For the sake of testing, I'm just trying to get this working with one page:

recode_pdf --from-imagestack 'doc/page_5_fixed.png' \
    --hocr-file doc/page_5_fixed.hocr \
    -o output.pdf

The output pdf does not have the text layer from the hocr file. The hocr is generated by gcv -> hocr tools.

Any idea what I might be doing wrong?

MerlijnWajer commented 1 year ago

Thanks for the report. Do you have an example hOCR file as generated by the gvc->hocr tools?

jcuenod commented 1 year ago

Sure, here are the files I was working with above. Github doesn't like the hocr file extension, so it's a txt now.

page_5_fixed page_5_fixed.txt

MerlijnWajer commented 1 year ago

@jcuenod - I just saw the reply in my inbox, really sorry I didn't see this earlier! I'll investigate tomorrow.

MerlijnWajer commented 1 year ago

I figured out the problem, the line elements are in an ocr_carea, without being wrapped in an additional ocrx_block or ocr_par - so the logical elements are missing.

With this simple change to archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools) the text finding works, and I think at that point the PDF generation will work too.


diff --git a/hocr/parse.py b/hocr/parse.py
index 0b45a2c..e0d6e6b 100644
--- a/hocr/parse.py
+++ b/hocr/parse.py
@@ -314,7 +314,8 @@ def hocr_page_to_word_data_fast(hocr_page):

     has_ocrx_cinfo = 0

-    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]'):
+    for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]') + hocr_page.findall('.//*[@class="ocr_carea"]'):

I need to take a moment to figure out if this is the right change and ensure things don't break elsewhere. There is some discussion here on the tag too https://github.com/kba/hocr-spec/issues/28

MerlijnWajer commented 9 months ago

I have been kind of busy but the patch above is will cause problems for other documents (although it might work for you), so the fix a little more complicated. I will try to get a proper fix in place for this. It seems like more users are hitting this.

Since I have switched away from lxml I've been running into some limitations of the xpath of the python standard library, so this might take a bit more trickery to get right. The good news is that I've at least added some tests in the past months, so we could add the hOCR version of your document to the tests when this is fixed, assuming that's OK with you.

jcuenod commented 9 months ago

That's fine with me. Thanks for your work on this!