Open jcuenod opened 1 year ago
Thanks for the report. Do you have an example hOCR file as generated by the gvc->hocr tools?
Sure, here are the files I was working with above. Github doesn't like the hocr file extension, so it's a txt now.
@jcuenod - I just saw the reply in my inbox, really sorry I didn't see this earlier! I'll investigate tomorrow.
I figured out the problem, the line elements are in an ocr_carea
, without being wrapped in an additional ocrx_block
or ocr_par
- so the logical elements are missing.
With this simple change to archive-hocr-tools (https://github.com/internetarchive/archive-hocr-tools) the text finding works, and I think at that point the PDF generation will work too.
diff --git a/hocr/parse.py b/hocr/parse.py
index 0b45a2c..e0d6e6b 100644
--- a/hocr/parse.py
+++ b/hocr/parse.py
@@ -314,7 +314,8 @@ def hocr_page_to_word_data_fast(hocr_page):
has_ocrx_cinfo = 0
- for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]'):
+ for par in hocr_page.findall('.//*[@class="ocr_par"]') + hocr_page.findall('.//*[@class="ocrx_block"]') + hocr_page.findall('.//*[@class="ocr_carea"]'):
I need to take a moment to figure out if this is the right change and ensure things don't break elsewhere. There is some discussion here on the tag too https://github.com/kba/hocr-spec/issues/28
I have been kind of busy but the patch above is will cause problems for other documents (although it might work for you), so the fix a little more complicated. I will try to get a proper fix in place for this. It seems like more users are hitting this.
Since I have switched away from lxml I've been running into some limitations of the xpath of the python standard library, so this might take a bit more trickery to get right. The good news is that I've at least added some tests in the past months, so we could add the hOCR version of your document to the tests when this is fixed, assuming that's OK with you.
That's fine with me. Thanks for your work on this!
For the sake of testing, I'm just trying to get this working with one page:
The output pdf does not have the text layer from the hocr file. The hocr is generated by gcv -> hocr tools.
Any idea what I might be doing wrong?