Closed jwachuta closed 1 year ago
Thanks for the report. The bug is unfortunately in lxml
, or even in libxml2
, see this bug report: https://bugs.launchpad.net/lxml/+bug/1970741
I can blacklist lxml
4.7.0 and higher in the code, so that others at least know where the problem is. How does that sound?
You can workaround this problem by installing lxml 4.6.5
Thank you. I can confirm that setting up a virtual environment with lxml 4.6.5 fixes the problem.
By the way, I am working on removing all usage of lxml
from the tool in a branch to see if archive-hocr-tools can just the default Python xml.etree
instead. It might be a bit slower, but at least we won't have to wait for a year to get a known corruption bug fixed... Newer python versions don't work with lxml
4.6.5
any more, so this is no longer a valid workaround.
I've moved away from lxml in master. :)
When trying out
hocr-combine-stream -g "*.hocr" > combined.html
to merge several hocr files produced by tesseract, the resulting output contains multiple</body>
and</html>
tags, at the close of each of the input pages.recode_pdf
is unable to work with the combined hocr file, exiting with the error "lxml.etree.XMLSyntaxError: Extra content at the end of the document."I'm using this for the first time, so I'm not sure if I'm doing something wrong or if this is a bug in one of the tools. I've tried with several different image sets but all had the same result.
Using Arch Linux with software versions:
archive-hocr-tools 1.1.19 archive-pdf-tools 1.4.15 python-lxml 4.8.0 tesseract 5.1.0 python 3.10.4
Steps to reproduce:
The attached example-hocr.zip file contains the .hocr files and combined.html produced from the preceding example.