miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
831 stars 119 forks source link

Incorrect (?) escaping of `;` (semicolon) in HTML-rendered link targets #102

Closed araspik closed 3 years ago

araspik commented 4 years ago

To reproduce:

import mistletoe
mistletoe.markdown("""
[link](https://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=doc/proxy-protocol.txt;hb=HEAD)
""")

Expected output (prettified):

<p>
    <a href="https://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=doc/proxy-protocol.txt;hb=HEAD">
        https://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=doc/proxy-protocol.txt;hb=HEAD
    </a>
</p>

Actual output (also prettified):

<p>
    <a href="https://git.haproxy.org/?p=haproxy.git%3Ba=blob_plain%3Bf=doc/proxy-protocol.txt%3Bhb=HEAD">
        https://git.haproxy.org/?p=haproxy.git;a=blob_plain;f=doc/proxy-protocol.txt;hb=HEAD
    </a>
</p>

Apparently, mistletoe is escaping the ; (semicolon) into %3B in URLs even when it's a valid URL character.

The CommonMark spec says the following (found below example 498):

URL-escaping should be left alone inside the destination, as all URL-escaped characters are also valid URL characters. Entity and numerical character references in the destination will be parsed into the corresponding Unicode code points, as usual. These may be optionally URL-escaped when written as HTML, but this spec does not enforce any particular policy for rendering URLs in HTML or other formats. Renderers may make different decisions about how to escape or normalize URLs in the output.

The issue seems to come from mistletoe/html_renderer.py:HTMLRenderer.escape_url:L216. The function is documented as helping to prevent code injection, but I don't know how ; could be used in such a manner. I'm not familiar with html.escape and related functions anyways, so am unable to suggest a fix.

A possible fix is to add ; to the list of 'safe' characters, making the line:

        return html.escape(quote(html.unescape(raw), safe='/#:()*?=%@+,&;'))

although I don't know of any possible security issues of this.

pbodnar commented 3 years ago

Good catch. The point is that quote() by default escapes also nearly all "special" characters. I would possibly fix this issue as you suggest. This should be also done for the other existing renderers, not just HTML.

pbodnar commented 3 years ago

OK, so it is fixed now in the master. All 3 renderers (HTML, Jira, XWiki) use the same set of safe characters now (each can theoretically use a different set in the future):

/#:()*?=%@+,&;

All 3 also unescape XML character references firstly now, which seems to be generally required by the CommonMark spec linked above:

Entity and numerical character references in the destination will be parsed into the corresponding Unicode code points, as usual. These may be optionally URL-escaped when written as HTML ...

So I hope the fix is correct.

The only renderer left to fix is for LaTeX, but I don't work with this one, so I will rather file a new issue for that.