New hOCR renderer fails to escape or clean text properly

slbayer commented 2 years ago

Bug report

The new hOCR renderer does not escape characters that need escaping. This PDF contains the string "A&P", which should be rendered in HTML as A&P. When I do this:

$ pdf2txt.py --output_type hocr AandP.pdf

I get

<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en' charset='utf-8'>
<head>
<title></title>
<meta http-equiv='Content-Type' content='text/html;charset=utf-8' />
<meta name='ocr-system' content='pdfminer.six HOCR Converter' />
  <meta name='ocr-capabilities' content='ocr_page ocr_block ocr_line ocrx_word'/>
</head>
<body>
<div class='ocr_page' id='1' title='bbox 0 0 612 792'>
<div class='ocr_block' id='0' title='bbox 56 57 81 69'>
<span class='ocr_line' title='bbox 56 57 81 69'><span style='font:"BAAAAA+LiberationSerif"; font-size:12; ' class='ocrx_word' title='bbox 56 57 81 69; x_font BAAAAA+LiberationSerif; x_fsize 12'>A&P</span></span>
</div>
</div>
<!-- comment in the following line to debug -->
<!--script src='https://unpkg.com/hocrjs'></script--></body></html>

which improperly contains the string A&P.

slbayer commented 2 years ago

Further testing reveals that if the string in the document had been <A&P>, the angle brackets would not have been escaped properly either.

slbayer commented 2 years ago

This needs to be fixed in two places. In release 20221105, in converter.py, line 934 should be

enc(self.working_text.strip()),

instead of

self.working_text.strip(),

and line 913 should be

self.write(enc(text))

instead of

self.write(text)

slbayer commented 2 years ago

Actually, I've now discovered something very closely related: if the stripcontrol attribute of the HOCRConverter is False, at least the lxml XML parser will fail on zero bytes (\x00). And the control stripping is supported in one place, but not another. My current recommendation is to replace lines 910 - 938 with

    def _clean_text(self, text: str) -> str:
        if self.stripcontrol:
            text = self.CONTROL.sub("", text)
        else:
            text = text.replace("\x00", "")
        return pdfminer.utils.enc(text)

    def write_text(self, text: str) -> None:
        self.write(self._clean_text(text))

    def write_word(self) -> None:
        if len(self.working_text) > 0:
            bold_and_italic_styles = ""
            if "Italic" in self.working_font:
                bold_and_italic_styles = "font-style: italic; "
            if "Bold" in self.working_font:
                bold_and_italic_styles += "font-weight: bold; "
            self.write(
                "<span style='font:\"%s\"; font-size:%d; %s' "
                "class='ocrx_word' title='%s; x_font %s; "
                "x_fsize %d'>%s</span>"
                % (
                    (
                        self.working_font,
                        self.working_size,
                        bold_and_italic_styles,
                        self.bbox_repr(self.working_bbox),
                        self.working_font,
                        self.working_size,
                        self._clean_text(self.working_text.strip()),
                    )
                )
            )
        self.within_chars = False

hrvoj3e commented 3 months ago

I can confirm that there is a problem with angle brackets. hOCR HTML is invalid.

@slbayer did you have any more new cases? I the proposed fix valid for current version? Do you plan to make a PR ?

pdfminer / pdfminer.six

New hOCR renderer fails to escape or clean text properly #836