[BUG] Single-quoted text cannot be parsed

chbndrhnns commented 4 years ago

Describe the bug If a docstring contains single-quoted text, it cannot be parsed

click to toggle

``` Traceback (most recent call last): File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run return self.callback() File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/livereload/handlers.py", line 69, in poll_tasks filepath, delay = cls.watcher.examine() File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/livereload/watcher.py", line 105, in examine func() File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocs/commands/serve.py", line 136, in builder build(config, live_server=live_server, dirty=dirty) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocs/commands/build.py", line 274, in build _populate_page(file.page, config, files, dirty) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocs/commands/build.py", line 174, in _populate_page page.render(config, files) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocs/structure/pages.py", line 183, in render self.content = md.convert(self.markdown) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/markdown/core.py", line 265, in convert root = self.parser.parseDocument(self.lines).getroot() File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/markdown/blockparser.py", line 90, in parseDocument self.parseChunk(self.root, '\n'.join(lines)) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/markdown/blockparser.py", line 105, in parseChunk self.parseBlocks(parent, text.split('\n\n')) File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/markdown/blockparser.py", line 123, in parseBlocks if processor.run(parent, blocks) is not False: File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocstrings/extension.py", line 168, in run as_xml = XML(rendered) File "/Users/powerjo/.pyenv/versions/3.8.2/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML parser.feed(text) File "", line None xml.etree.ElementTree.ParseError: undefined entity: line 13, column 4 ```

To Reproduce Trying to parse this class fails with xml.etree.ElementTree.ParseError: undefined entity: line 13, column 4:

class A:
    """
    'text'
    """

Expected behavior The example can be parsed.

Screenshots If you are using pytkdocs through mkdocstrings and if relevant, please attach a screenshot.

System (please complete the following information):

pytkdocs version [e.g. 0.2.1]
Python version: [e.g. 3.8]
OS: [Windows/Linux]

Additional context Add any other context about the problem here.

chbndrhnns commented 4 years ago

A single single-quote is enough to break the parser. This example also does not work for me:

class A:
    """VRF's"""

pawamoy commented 4 years ago

I cannot reproduce, can you share the pytkdocs version please?

chbndrhnns commented 4 years ago

I am using pytkdocs 0.6

Somehow the single quote gets converted to ‘ and then the XML parsing fails:

from xml.etree.ElementTree import XML
text = """<div class="doc doc-contents first">
       <p>&lsquo;</p>

     </div>"""
XML(text)

This fails with


Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/powerjo/.pyenv/versions/3.8.2/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 19, column 9

pawamoy commented 4 years ago

What Python version are you using?

Could you try to run the test suite for pytkdocs on your macos laptop?

git clone https://github.com/pawamoy/pytkdocs
cd pytkdocs
make setup test

chbndrhnns commented 4 years ago

The tests all pass. I am using Python 3.8. For the test suite, however, poetry installs 3.7.

When I look at the exception in the original post, I see that the pytkdocs does not occur in the library but the error occurs in the an extension module of mkdocstrings:

File "/Users/powerjo/dev/juice/juiceutils/.venv/lib/python3.8/site-packages/mkdocstrings/extension.py", line 168, in run
    as_xml = XML(rendered)
  File "/Users/powerjo/.pyenv/versions/3.8.2/lib/python3.8/xml/etree/ElementTree.py", line 1320, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 13, column 4

chbndrhnns commented 4 years ago

I found the issue:

markdown_extensions:
  - smarty

Smarty converts the quotes which then cannot be parsed by the XML library.

It would be required to unescape the entities first:

from xml.etree.ElementTree import XML
import html

text = "<div>&lsquo;</div>"
unescaped = html.unescape(text)

XML(unescaped) # passes
XML(text) # fails

pawamoy commented 4 years ago

When I look at the exception in the original post, I see that the pytkdocs does not occur in the library but the error occurs in the an extension module of mkdocstrings:

Oh yes, you're right, sorry about that, I was tired...

I found the issue:

Great! Thank you for debugging this :slightly_smiling_face:

So, I don't think it's possible to unescape the contents, as < or > would then break the XML parsing as well. But I wonder if wrapping the contents in <html>...</html> would make the parser "understand" the ‘ and similar escaped characters. I'll try that and report back :slightly_smiling_face:

pawamoy commented 4 years ago

Wow, that was a wild ride.

The XMLParser class has a html parameter, with which you could define entities such as lsquo, but this parameter is now deprecated. The class sets self.entity = {}.

In Python 2 you could therefore do parser = XMLParser(); parser.entity["lsquo"] = "...", but it doesn't work anymore in Python 3 because it uses C extensions, so you cannot change the object, and trying to access the parser's attributes ends in AttributeError. You cannot inspect the object in debugging sessions either.

I finally found a solution on this SO post. You have to prepend the to-be-parsed text with the entities definition so the parser doesn't crash on them.

ENTITIES = """
    <!DOCTYPE html [
        <!ENTITY nbsp '&amp;nbsp;'>
        <!ENTITY lsquo '&amp;lsquo;'>
        <!ENTITY rsquo '&amp;rsquo;'>
        <!ENTITY ldquo '&amp;ldquo;'>
        <!ENTITY rdquo '&amp;rdquo;'>
        <!ENTITY laquo '&amp;laquo;'>
        <!ENTITY raquo '&amp;raquo;'>
        <!ENTITY hellip '&amp;hellip;'>
        <!ENTITY ndash '&amp;ndash;'>
        <!ENTITY mdash '&amp;mdash;'>
    ]>
"""

parsed = XML(ENTITIES + text)

Damn SmartyPants :angry: :heart: :anger: :hot_face: !

I'll try to release the fix soon.

chbndrhnns commented 4 years ago

@pawamoy Were you able to add the fix for this issue to another release, already?

pawamoy commented 4 years ago

@chbndrhnns I will do it now, thanks for the reminder :slightly_smiling_face:

pawamoy commented 4 years ago

This is fixed in 0.12.1, please reopen if needed.

mkdocstrings / mkdocstrings

[BUG] Single-quoted text cannot be parsed #129