Closed Crozzers closed 1 year ago
_unescape_special_chars
will be slightly slower due to this change.
- Perhaps it should only re-check the escape tables if there is a hash present in the output? This would be easy since they all follow the
md5-[A-Za-z0-9]{32}
format
I did some limited testing on this and it turns out that _unescape_special_chars
is not noticeably slower after this PR.
When I tested the regex idea in the second bullet point, it was twice as slow so perhaps simpler is better
I think that while True is hanging indefinitely. A quick look and it doesn't seem like it's really required.
My bad, used !=
instead of ==
.
And the loop is required because links with underscores will have the underscores hashed before the entire link is hashed, creating a nested hash. For example:
Eric wrote up a (long) intro to writing UDL definitions a while back on
his blog: <http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html>
After _escape_special_chars
will become:
... <http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html>
After _protect_url
will become:
... <a href="md5-24db7b0b22e1efe48c9562adb437c6fd">http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html</a>
And after only one pass of _unescape_special_chars
it still contains the hashed underscores:
'<p> ... <a href="http://blogs.activestate.com/ericp/2007/01/kidmd5-5bc59197ef4366ee97e270f066784a86addingmd5-5bc59197ef4366ee97e270f066784a86amd5-5bc59197ef4366ee97e270f066784a86ne.html">http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html</a></p>'
Which is why I used the loop
On my list to review this week.
Nice thank you! Sorry for delay.
This PR aims to fix #384 by stopping ampersands from being escaped in URLs.
The solution:
The solution uses
_html_escape_url
to remove anything potentially nasty from the URL. After this, the URL is hashed to prevent any further mechanisms from messing with it. The hashed URL is then substituted back in by_unescape_special_chars
.The
_unescape_special_chars
has also been modified to handle nested hashes, which became necessary due to this patch as URLs would have their underscores hashed by_escape_special_chars
before being entirely hashed again, creating a nested hash._unescape_special_chars
will now simply keep replacing hashes in the text until the text stops changing, at which point it returns.Potential flaws:
_html_escape_url
is being trusted to return a safe URL._html_escape_url
_unescape_special_chars
will be slightly slower due to this change.md5-[A-Za-z0-9]{32}
formatNot fixed:
This PR does NOT fix escaping ampersands in codeblocks due to this snippet from
_encode_code
indicating that this is intentional behaviour: https://github.com/trentm/python-markdown2/blob/b396902f39b005cc8eefb4577f70dcb1055cb9c5/lib/markdown2.py#L2134-L2137This PR also does NOT fix escaping ampersands in maths/latex blocks as markdown2 doesn't currently seem to support maths/latex blocks so there is no mechanism for dealing with their contents