miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
818 stars 118 forks source link

Escape special LaTeX characters in URLs #190

Closed choeppler closed 1 year ago

choeppler commented 1 year ago

If \url{} is used within macros like \multicolumn and if the link to be rendered contains # or % as typical confluence links like https://site.com/confluence/display/TEST/TEST-042%3A+Foo#target are bound to, pdflatex (from TeX Live 2019) fails with errors. This patch fixes that by escaping these two characters.

NB: if \url{} with such links is used outside of multicolumn-macros, they work regardless of whether these characters are escaped, and the result remains the same.

pbodnar commented 1 year ago

If \url{} is used within macros like \multicolumn and if the link to be rendered contains # or % as typical confluence links like https://site.com/confluence/display/TEST/TEST-042%3A+Foo#target are bound to, pdflatex (from TeX Live 2019) fails with errors. This patch fixes that by escaping these two characters.

@choeppler, thank you for looking at this issue.

Regarding what characters to escape, when looking at this example LaTeX output:

\href{https://site.com/confluence#something}{link}

... shouldn't we escape \, { and } as well (while replacing \ for \\ firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.

pbodnar commented 1 year ago

Regarding the commit message, it looks like you've got some redundant characters on the first line. It is also good to add issue number. I.e.:

-fix: escape special LaTeX characters in URLs``
+fix: escape special LaTeX characters in URLs (#114)
pbodnar commented 1 year ago

@choeppler, you still there? It would be great to have some feedback to my feedback. :) I think I can find the time to finish this change on my self otherwise...

choeppler commented 1 year ago

@pbodnar , thank you for your review and sorry for the delay. I've been unexpectedly offline for some weeks but will try to address your review comments within the next few days. Ok?

pbodnar commented 1 year ago

@choeppler, sure, I can wait. Good to hear you're back. :)

choeppler commented 1 year ago

Regarding what characters to escape, when looking at this example LaTeX output:

\href{https://site.com/confluence#something}{link}

... shouldn't we escape \, { and } as well (while replacing \ for \\ firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.

Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters \, { and } should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.

choeppler commented 1 year ago

Regarding the commit message, it looks like you've got some redundant characters on the first line. It is also good to add issue number. I.e.:

-fix: escape special LaTeX characters in URLs``
+fix: escape special LaTeX characters in URLs (#114)

@pbodnar, I've added the changes addressing your review comments as additional commits to make the review process easier. Once we're done, I can prepare the merge by squashing and force-pushing this PR's commits. Plan to update the commit message as advised then.

pbodnar commented 1 year ago

Regarding what characters to escape, when looking at this example LaTeX output:

\href{https://site.com/confluence#something}{link}

... shouldn't we escape \, { and } as well (while replacing \ for \\ firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.

Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters \, { and } should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.

Well, yes, that's the theory though. In practice, the characters \, { and } are still used sometimes (see e.g. this answer on the page you link to). An example accepted by Google: http://www.google.com/?q={a:1,b:2}.

Next, there is the correct finding of yours that all the possible characters with special meaning in LaTeX (as discussed within #112) are already escaped within _render_rawtext() (which also includes escaping of the label part of links):

    def render_raw_text(self, token, escape=True):
        return (token.content.replace('$', '\\$').replace('#', '\\#')
                             .replace('{', '\\{').replace('}', '\\}')
                             .replace('&', '\\&').replace('_', '\\_')
                             .replace('%', '\\%')
               ) if escape else token.content

So, to keep it simple, yet robust, and not to repeat ourselves in the code, what about simply using this existing method to also escape the URL part of the link? Or could it break something? I've tried to test links e.g. via https://latexeditor.lagrida.com/, but it doesn't seem to unescape e.g. \&, but I'm not sure if I can trust that tool. I couldn't find anything else working without registration fast.

It would be great if there was some LaTeX documentation that would direct us on this topic, but I can remember I couldn't find anything really useful some years ago. Note that I don't use LaTeX myself...

choeppler commented 1 year ago

Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters \, { and } should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.

Well, yes, that's the theory though. In practice, the characters \, { and } are still used sometimes (see e.g. this answer on the page you link to). An example accepted by Google: http://www.google.com/?q={a:1,b:2}.

I see. (Funny: when copied from my browser's address field, the url is properly escaped: http://www.google.com/?q=%7Ba:1,b:2%7D...)

Next, there is the correct finding of yours that all the possible characters with special meaning in LaTeX (as discussed within #112) are already escaped within _render_rawtext() (which also includes escaping of the label part of links):

    def render_raw_text(self, token, escape=True):
        return (token.content.replace('$', '\\$').replace('#', '\\#')
                             .replace('{', '\\{').replace('}', '\\}')
                             .replace('&', '\\&').replace('_', '\\_')
                             .replace('%', '\\%')
               ) if escape else token.content

So, to keep it simple, yet robust, and not to repeat ourselves in the code, what about simply using this existing method to also escape the URL part of the link? Or could it break something? I've tried to test links e.g. via https://latexeditor.lagrida.com/, but it doesn't seem to unescape e.g. \&, but I'm not sure if I can trust that tool. I couldn't find anything else working without registration fast.

It would be great if there was some LaTeX documentation that would direct us on this topic, but I can remember I couldn't find anything really useful some years ago. Note that I don't use LaTeX myself...

I haven't found any useful documentation either but I've been using pdflatex for a long time and just checked, how that implementation behaves if we escape all characters from the set escaped by render_raw_text in the URL part:

To sum up: the following hyperref-links are both displayed as and link to http://a/A$#{}&_% (NB: escaping the label part works as expected):

However, most of this is not so much of a problem if the URL-part is first escaped by HTMLRenderer.escape_url as proposed in 97bfce1, because then the only special characters that remain are %, #, _, and &, (the last two of which do not have to be escaped in the URL part but it doesn't seem to hurt either): <http://a/A$#{}&_%> --> \url{http://a/A\%24\#\%7B\%7D&amp;_\%}

(My only concern with this solution is that if the links in the markdown-source are already url-escaped, then the & becomes double-escaped: <http://a/A%24#%7B%7D&amp;_%> --> \url{http://a/A\%24\#\%7B\%7D&amp;amp;_\%}. But that could be noted down as a separate issue, I guess...)

pbodnar commented 1 year ago

@choeppler, nice analysis, thanks for that. The code also looks +- OK.

Just this one:

(My only concern with this solution is that if the links in the markdown-source are already url-escaped, then the & becomes double-escaped: <http://a/A%24#%7B%7D&amp;_%> --> \url{http://a/A\%24\#\%7B\%7D&amp;amp;_\%}. But that could be noted down as a separate issue, I guess...)

Yeah, that's the effect of the call to html.escape() inside HTMLRenderer.escape_url(). I would propose then not to call that function from LaTeX renderer code and instead simply duplicate this part: quote(raw, safe='/#:()*?=%@+,&;'). I think we could make a shared function out of this expression (it is also used in Jira and XWiki renderers), but rather in a separate commit / MR.

coveralls commented 1 year ago

Coverage Status

coverage: 94.056% (+0.01%) from 94.044% when pulling 92e246bb2a363237a40264ef8e2feb7dcfd86a5b on boschresearch:fix-escape-latex-urls into d7f06d05d35bd849ca54865fdc6133132e6cab5f on miyuchina:master.

pbodnar commented 1 year ago

@choeppler, thank you!