hocr-combine-stream output contains multiple </body></html> tags, producing invalid xml

jwachuta commented 2 years ago

When trying out hocr-combine-stream -g "*.hocr" > combined.html to merge several hocr files produced by tesseract, the resulting output contains multiple </body> and </html> tags, at the close of each of the input pages. recode_pdf is unable to work with the combined hocr file, exiting with the error "lxml.etree.XMLSyntaxError: Extra content at the end of the document."

I'm using this for the first time, so I'm not sure if I'm doing something wrong or if this is a bug in one of the tools. I've tried with several different image sets but all had the same result.

Using Arch Linux with software versions:

archive-hocr-tools 1.1.19 archive-pdf-tools 1.4.15 python-lxml 4.8.0 tesseract 5.1.0 python 3.10.4

Steps to reproduce:

$ wget https://ia800303.us.archive.org/28/items/rubaiyatfitzgera00omar/rubaiyatfitzgera00omar_jp2.zip
$ unzip rubaiyatfitzgera00omar_jp2.zip
$ cd rubaiyatfitzgera00omar_jp2
$ fd -e jp2 -x bash -c "magick {} TIFF:- | tesseract --dpi 300 - {.} hocr"
$ hocr-combine-stream -g "*.hocr" > combined.html
$ recode_pdf --from-imagestack "*.jp2" --hocr-file combined.html --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o test.pdf
Traceback (most recent call last):
  File "/home/jw/.local/bin/recode_pdf", line 290, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 634, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
    for idx, hocr_page in enumerate(hocr_iter):
  File "/home/jw/.local/lib/python3.10/site-packages/hocr/parse.py", line 47, in hocr_page_iterator
    for act, elem in doc:
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "/home/jw/rubaiyatfitzgera00omar_jp2/combined.html", line 25
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 25, column 2

The attached example-hocr.zip file contains the .hocr files and combined.html produced from the preceding example.

MerlijnWajer commented 2 years ago

Thanks for the report. The bug is unfortunately in lxml, or even in libxml2, see this bug report: https://bugs.launchpad.net/lxml/+bug/1970741

I can blacklist lxml 4.7.0 and higher in the code, so that others at least know where the problem is. How does that sound?

You can workaround this problem by installing lxml 4.6.5

jwachuta commented 2 years ago

Thank you. I can confirm that setting up a virtual environment with lxml 4.6.5 fixes the problem.

MerlijnWajer commented 1 year ago

By the way, I am working on removing all usage of lxml from the tool in a branch to see if archive-hocr-tools can just the default Python xml.etree instead. It might be a bit slower, but at least we won't have to wait for a year to get a known corruption bug fixed... Newer python versions don't work with lxml 4.6.5 any more, so this is no longer a valid workaround.

MerlijnWajer commented 1 year ago

https://github.com/internetarchive/archive-hocr-tools/commits/native-python-xml

MerlijnWajer commented 1 year ago

I've moved away from lxml in master. :)

internetarchive / archive-hocr-tools

hocr-combine-stream output contains multiple </body></html> tags, producing invalid xml #5