Fix escaping ampersands in hrefs

Crozzers commented 1 year ago

This PR aims to fix #384 by stopping ampersands from being escaped in URLs.

The solution:

The solution uses _html_escape_url to remove anything potentially nasty from the URL. After this, the URL is hashed to prevent any further mechanisms from messing with it. The hashed URL is then substituted back in by _unescape_special_chars.

The _unescape_special_chars has also been modified to handle nested hashes, which became necessary due to this patch as URLs would have their underscores hashed by _escape_special_chars before being entirely hashed again, creating a nested hash. _unescape_special_chars will now simply keep replacing hashes in the text until the text stops changing, at which point it returns.

Potential flaws:

Hashing URLs protect them from being tampered with by further safety mechanisms. _html_escape_url is being trusted to return a safe URL.
- However, the only additional safety mechanism that seems to get triggered after links are processed is encoding ampersands and angles, the latter of which is handled by _html_escape_url
_unescape_special_chars will be slightly slower due to this change.
- Perhaps it should only re-check the escape tables if there is a hash present in the output? This would be easy since they all follow the md5-[A-Za-z0-9]{32} format

Not fixed:

This PR does NOT fix escaping ampersands in codeblocks due to this snippet from _encode_code indicating that this is intentional behaviour: https://github.com/trentm/python-markdown2/blob/b396902f39b005cc8eefb4577f70dcb1055cb9c5/lib/markdown2.py#L2134-L2137

This PR also does NOT fix escaping ampersands in maths/latex blocks as markdown2 doesn't currently seem to support maths/latex blocks so there is no mechanism for dealing with their contents

Crozzers commented 1 year ago

_unescape_special_chars will be slightly slower due to this change.

Perhaps it should only re-check the escape tables if there is a hash present in the output? This would be easy since they all follow the md5-[A-Za-z0-9]{32} format

I did some limited testing on this and it turns out that _unescape_special_chars is not noticeably slower after this PR. When I tested the regex idea in the second bullet point, it was twice as slow so perhaps simpler is better

nicholasserra commented 1 year ago

I think that while True is hanging indefinitely. A quick look and it doesn't seem like it's really required.

Crozzers commented 1 year ago

My bad, used != instead of ==.

And the loop is required because links with underscores will have the underscores hashed before the entire link is hashed, creating a nested hash. For example:

Eric wrote up a (long) intro to writing UDL definitions a while back on
his blog: <http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html>

After _escape_special_chars will become:

... <http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html>

After _protect_url will become:

... <a href="md5-24db7b0b22e1efe48c9562adb437c6fd">http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html</a>

And after only one pass of _unescape_special_chars it still contains the hashed underscores:

'<p> ... <a href="http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html">http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html</a></p>'

Which is why I used the loop

nicholasserra commented 1 year ago

On my list to review this week.

nicholasserra commented 1 year ago

Nice thank you! Sorry for delay.

trentm / python-markdown2