Open slbayer opened 2 years ago
Further testing reveals that if the string in the document had been <A&P>
, the angle brackets would not have been escaped properly either.
This needs to be fixed in two places. In release 20221105, in converter.py
, line 934 should be
enc(self.working_text.strip()),
instead of
self.working_text.strip(),
and line 913 should be
self.write(enc(text))
instead of
self.write(text)
Actually, I've now discovered something very closely related: if the stripcontrol
attribute of the HOCRConverter
is False
, at least the lxml
XML parser will fail on zero bytes (\x00
). And the control stripping is supported in one place, but not another. My current recommendation is to replace lines 910 - 938 with
def _clean_text(self, text: str) -> str:
if self.stripcontrol:
text = self.CONTROL.sub("", text)
else:
text = text.replace("\x00", "")
return pdfminer.utils.enc(text)
def write_text(self, text: str) -> None:
self.write(self._clean_text(text))
def write_word(self) -> None:
if len(self.working_text) > 0:
bold_and_italic_styles = ""
if "Italic" in self.working_font:
bold_and_italic_styles = "font-style: italic; "
if "Bold" in self.working_font:
bold_and_italic_styles += "font-weight: bold; "
self.write(
"<span style='font:\"%s\"; font-size:%d; %s' "
"class='ocrx_word' title='%s; x_font %s; "
"x_fsize %d'>%s</span>"
% (
(
self.working_font,
self.working_size,
bold_and_italic_styles,
self.bbox_repr(self.working_bbox),
self.working_font,
self.working_size,
self._clean_text(self.working_text.strip()),
)
)
)
self.within_chars = False
I can confirm that there is a problem with angle brackets. hOCR HTML is invalid.
@slbayer did you have any more new cases? I the proposed fix valid for current version? Do you plan to make a PR ?
Bug report
The new hOCR renderer does not escape characters that need escaping. This PDF contains the string "A&P", which should be rendered in HTML as
A&P
. When I do this:$ pdf2txt.py --output_type hocr AandP.pdf
I get
which improperly contains the string
A&P
.