Closed choeppler closed 1 year ago
If
\url{}
is used within macros like\multicolumn
and if the link to be rendered contains#
or%
as typical confluence links like https://site.com/confluence/display/TEST/TEST-042%3A+Foo#target are bound to, pdflatex (from TeX Live 2019) fails with errors. This patch fixes that by escaping these two characters.
@choeppler, thank you for looking at this issue.
Regarding what characters to escape, when looking at this example LaTeX output:
\href{https://site.com/confluence#something}{link}
... shouldn't we escape \
, {
and }
as well (while replacing \
for \\
firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.
Regarding the commit message, it looks like you've got some redundant characters on the first line. It is also good to add issue number. I.e.:
-fix: escape special LaTeX characters in URLs``
+fix: escape special LaTeX characters in URLs (#114)
@choeppler, you still there? It would be great to have some feedback to my feedback. :) I think I can find the time to finish this change on my self otherwise...
@pbodnar , thank you for your review and sorry for the delay. I've been unexpectedly offline for some weeks but will try to address your review comments within the next few days. Ok?
@choeppler, sure, I can wait. Good to hear you're back. :)
Regarding what characters to escape, when looking at this example LaTeX output:
\href{https://site.com/confluence#something}{link}
... shouldn't we escape
\
,{
and}
as well (while replacing\
for\\
firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.
Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters \
, {
and }
should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.
Regarding the commit message, it looks like you've got some redundant characters on the first line. It is also good to add issue number. I.e.:
-fix: escape special LaTeX characters in URLs`` +fix: escape special LaTeX characters in URLs (#114)
@pbodnar, I've added the changes addressing your review comments as additional commits to make the review process easier. Once we're done, I can prepare the merge by squashing and force-pushing this PR's commits. Plan to update the commit message as advised then.
Regarding what characters to escape, when looking at this example LaTeX output:
\href{https://site.com/confluence#something}{link}
... shouldn't we escape
\
,{
and}
as well (while replacing\
for\\
firstly)? I know it is not quite common to have these characters in a URL, yet it is practically possible.Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters
\
,{
and}
should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.
Well, yes, that's the theory though. In practice, the characters \
, {
and }
are still used sometimes (see e.g. this answer on the page you link to). An example accepted by Google: http://www.google.com/?q={a:1,b:2}.
Next, there is the correct finding of yours that all the possible characters with special meaning in LaTeX (as discussed within #112) are already escaped within _render_rawtext() (which also includes escaping of the label part of links):
def render_raw_text(self, token, escape=True):
return (token.content.replace('$', '\\$').replace('#', '\\#')
.replace('{', '\\{').replace('}', '\\}')
.replace('&', '\\&').replace('_', '\\_')
.replace('%', '\\%')
) if escape else token.content
So, to keep it simple, yet robust, and not to repeat ourselves in the code, what about simply using this existing method to also escape the URL part of the link? Or could it break something? I've tried to test links e.g. via https://latexeditor.lagrida.com/, but it doesn't seem to unescape e.g. \&
, but I'm not sure if I can trust that tool. I couldn't find anything else working without registration fast.
It would be great if there was some LaTeX documentation that would direct us on this topic, but I can remember I couldn't find anything really useful some years ago. Note that I don't use LaTeX myself...
Well, looking at the first couple of answers to https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid, and also the RFCs these are based on, the three characters
\
,{
and}
should not occur in valid URLs. So I think we shouldn't escape those here. However, I just tested the complete valid character set with my LaTeX testcase and found '&' has to be escaped on the LaTeX-side, as well. I'll update the patch accordingly.Well, yes, that's the theory though. In practice, the characters
\
,{
and}
are still used sometimes (see e.g. this answer on the page you link to). An example accepted by Google: http://www.google.com/?q={a:1,b:2}.
I see. (Funny: when copied from my browser's address field, the url is properly escaped: http://www.google.com/?q=%7Ba:1,b:2%7D
...)
Next, there is the correct finding of yours that all the possible characters with special meaning in LaTeX (as discussed within #112) are already escaped within _render_rawtext() (which also includes escaping of the label part of links):
def render_raw_text(self, token, escape=True): return (token.content.replace('$', '\\$').replace('#', '\\#') .replace('{', '\\{').replace('}', '\\}') .replace('&', '\\&').replace('_', '\\_') .replace('%', '\\%') ) if escape else token.content
So, to keep it simple, yet robust, and not to repeat ourselves in the code, what about simply using this existing method to also escape the URL part of the link? Or could it break something? I've tried to test links e.g. via https://latexeditor.lagrida.com/, but it doesn't seem to unescape e.g.
\&
, but I'm not sure if I can trust that tool. I couldn't find anything else working without registration fast.It would be great if there was some LaTeX documentation that would direct us on this topic, but I can remember I couldn't find anything really useful some years ago. Note that I don't use LaTeX myself...
I haven't found any useful documentation either but I've been using pdflatex for a long time and just checked, how that implementation behaves if we escape all characters from the set escaped by render_raw_text
in the URL part:
\{
and \}
if used in the URL part are not unescaped. If {
and }
are balanced, they can be used as is in the URL part, though.\$
leads to a TEX stack overflow. $
can be used in the URL part.To sum up: the following hyperref-links are both displayed as and link to http://a/A$#{}&_%
(NB: escaping the label part works as expected):
\href{http://a/A$\#{}\&\_\%}{http://a/A\$\#\{\}\&\_\%}
(max number of escapes in URL part that are properly dealt with by pdflatex)\href{http://a/A$\#{}&_\%}{http://a/A\$\#\{\}\&\_\%}
(min number of necessary escapes in URL part for pdflatex to work without errors)However, most of this is not so much of a problem if the URL-part is first escaped by HTMLRenderer.escape_url
as proposed in 97bfce1, because then the only special characters that remain are %
, #
, _
, and &
, (the last two of which do not have to be escaped in the URL part but it doesn't seem to hurt either):
<http://a/A$#{}&_%>
--> \url{http://a/A\%24\#\%7B\%7D&_\%}
(My only concern with this solution is that if the links in the markdown-source are already url-escaped, then the &
becomes double-escaped:
<http://a/A%24#%7B%7D&_%>
--> \url{http://a/A\%24\#\%7B\%7D&amp;_\%}
. But that could be noted down as a separate issue, I guess...)
@choeppler, nice analysis, thanks for that. The code also looks +- OK.
Just this one:
(My only concern with this solution is that if the links in the markdown-source are already url-escaped, then the
&
becomes double-escaped:<http://a/A%24#%7B%7D&_%>
-->\url{http://a/A\%24\#\%7B\%7D&amp;_\%}
. But that could be noted down as a separate issue, I guess...)
Yeah, that's the effect of the call to html.escape()
inside HTMLRenderer.escape_url()
. I would propose then not to call that function from LaTeX renderer code and instead simply duplicate this part: quote(raw, safe='/#:()*?=%@+,&;')
. I think we could make a shared function out of this expression (it is also used in Jira and XWiki renderers), but rather in a separate commit / MR.
@choeppler, thank you!
If
\url{}
is used within macros like\multicolumn
and if the link to be rendered contains#
or%
as typical confluence links like https://site.com/confluence/display/TEST/TEST-042%3A+Foo#target are bound to, pdflatex (from TeX Live 2019) fails with errors. This patch fixes that by escaping these two characters.NB: if
\url{}
with such links is used outside of multicolumn-macros, they work regardless of whether these characters are escaped, and the result remains the same.