googleapis / python-documentai-toolbox

Document AI Toolbox is an SDK for Python that provides utility functions for managing, manipulating, and extracting information from the document response. It creates a "wrapped" document object from JSON files in Cloud Storage, local JSON files, or output directly from the Document AI API.
https://cloud.google.com/document-ai/docs/toolbox
Apache License 2.0
32 stars 13 forks source link

fix: Escape html special characters in `hocr_document_template.xml.j2` #279

Closed holtskinner closed 6 months ago

holtskinner commented 6 months ago

Special characters need to be escaped in order to utilize the output from the HOCR conversion in other tools. The j2 spec also suggests to escape characters (see HTML escaping at https://jinja.palletsprojects.com/en/3.0.x/templates/)

Reported in Customer Issue b/329048716

Fixes #213 🦕

Replacement for #239

holtskinner commented 6 months ago

Verified that Test Input failed before HTML escaping added:

_________________ test_export_hocr_str_with_escape_characters __________________

    def test_export_hocr_str_with_escape_characters():
        wrapped_document = document.Document.from_document_path(
            document_path="tests/unit/resources/toolbox_invoice_test-0-hocr-escape.json"
        )

        actual_hocr = wrapped_document.export_hocr_str(title="toolbox_invoice_test-0")
        assert actual_hocr

>       element = ElementTree.fromstring(actual_hocr)

tests/unit/test_document.py:8[27](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:28): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

text = '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3...rx_word\' id=\'word_1_[30](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:31)_0_0_4\' title=\'bbox 585 1781 620 1818\'>t Q</span></span></p></span></div>\n</body>\n</html>'
parser = <xml.etree.ElementTree.XMLParser object at 0x7f67e1a1c160>

    def XML(text, parser=None):
        """Parse XML document from string constant.

        This function can be used to embed "XML Literals" in Python code.

        *text* is a string containing XML data, *parser* is an
        optional parser instance, defaulting to the standard XMLParser.

        Returns an Element instance.

        """
        if not parser:
            parser = XMLParser(target=TreeBuilder())
>       parser.feed(text)
E       xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279

/opt/hostedtoolcache/Python/3.11.8/x64/lib/python3.11/xml/etree/ElementTree.py:1[34](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:35)5: ParseError
- generated xml file: /home/runner/work/python-documentai-toolbox/python-documentai-toolbox/unit_3.11_sponge_log.xml -
=========================== short test summary info ============================
FAILED tests/unit/test_document.py::test_export_hocr_str_with_escape_characters - xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 279
1 failed, 1[52](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:53) passed in [57](https://github.com/googleapis/python-documentai-toolbox/actions/runs/8235362103/job/22519269422?pr=279#step:5:58).65s
nox > Command py.test --quiet --junitxml=unit_3.11_sponge_log.xml --cov=google --cov=tests/unit --cov-append --cov-config=.coveragerc --cov-report= --cov-fail-under=0 tests/unit failed with exit code 1