jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.67k stars 3.38k forks source link

Pandoc percent-encodes some URI characters #6525

Open cjerdonek opened 4 years ago

cjerdonek commented 4 years ago

Pandoc percent-encodes some characters like "[" and "]" in URIs. For example, this

becomes:

https://pandoc.org/try/?text=%3Ca+href%3D%22https%3A%2F%2Fexample.com%2F_%5Bbracketed%5D_%22%3Elink%3C%2Fa%3E&from=html&to=gfm&standalone=0

(I believe this is because of e.g. #2377.)

A problem though is that some characters like brackets are reserved characters, which means that if the character has a special meaning in a particular context, then percent-encoding the character will cause the URI to be different (because the non-percent-encoded occurrences of the character will no longer be distinguishable from the percent-encoded occurrences).

In general, I think a better approach to work towards would be to leave URI's alone as the user gave them (even for characters like "<" and ">" that aren't reserved). That way URI's will roundtrip. And if a particular character is problematic for some destination format, then Pandoc can escape the character when writing to that format, but using the escaping method of the destination format (e.g. backslash-escaping parentheses in the case of Markdown) rather than altering the URI.

mb21 commented 4 years ago

But if it's a reserved character, that means you should escape it, right? Thing is, the HTML in question is invalid. Putting it in https://validator.w3.org, gives:

Bad value https://example.com/_[bracketed]_ for attribute href on element a: Illegal character in path segment: [ is not allowed.

and I don't think pandoc gives any guarantees on what happens on invalid HTML intput....


btw. current behaviour happens in the HTML reader:

echo '<a href="https://example.com/_[bracketed]_">link</a>' | pandoc -f html -t native

[Plain [Link ("",[],[]) [Str "link"] ("https://example.com/_%5Bbracketed%5D_","")]]
cjerdonek commented 4 years ago

But if it's a reserved character, that means you should escape it, right?

Not in general, no. If you look at the Wikipedia page I linked to, for example, you'll see that reserved characters are characters that may have a special meaning in certain contexts, and those characters should only be percent-encoded when not being used for that special purpose. So it depends.

To give an example, "@" is a reserved character, but mailto:fred@example.com is an example of a URI where the "@" is in the path portion but not percent-encoded. The reason it's not percent-encoded is that it's being used for the special purpose of separating the username from the domain.

I didn't mean to base my suggestion on technical standards though. What I'm primarily saying is that I think for users it would be less surprising and easier to understand and work with if Pandoc left alone the URI's that the user gave, as opposed to trying to fix them in some way.

If Pandoc leaves them alone, users will have more control over how URIs will look in the output e.g. in Markdown. This way users can choose to make the URIs be more human-readable instead of having lots of % symbols.

Pandoc already chooses not to percent-encode non-ascii characters (which is a good thing), even though the latest RFC (RFC 3986) says non-ascii characters also aren't allowed in URI's and must be percent-encoded. So another way to look at my suggestion is for Pandoc to treat characters like "[", "]", "<", etc. like it chose to treat non-ascii characters.

mb21 commented 4 years ago

well, pandoc should generate valid output, so somewhere we should escape things properly... here, the question is what to do with invalid input...

btw. I think RFC 3986 was superseded by RFC 3987...

cjerdonek commented 4 years ago

I think RFC 3986 was superseded by RFC 3987...

I don't think so. For example, if you read the intro it says, "This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement to the Uniform Resource Identifier (URI)." Also, it doesn't say it "Updates" or "Obsoletes" anything at the top, like RFC 3986 does.

The issue is also evident when converting from Markdown to Markdown, so it's not just about HTML I don't think. Example:

https://pandoc.org/try/?text=%5Blink%5D(https%3A%2F%2Fexample.com%2F_%5B_)&from=markdown&to=markdown&standalone=0

Also, one more thing: the issue (#2377) that lead to these characters being escaped used RFC 2396's section on excluded characters as its justification, but that RFC was obsoleted by RFC 3986. Either way, I still think the very complicated and ever-changing task of validating URI's should be out of scope for Pandoc, in which case Pandoc can just pass them along as is..

cjerdonek commented 4 years ago

Just as another data point, the HTML Tidy tool has had a --fix-uri yes/no command-line option since at least 2016 and perhaps a lot longer before. If yes, it percent-encodes characters in URI's (including any non-ascii characters). And if no, it leaves all URI's as is.

Here are a couple issues there related to the behavior of this option (the primary question there was whether non-ascii characters needed to be quoted): https://github.com/htacg/tidy-html5/issues/352 https://github.com/htacg/tidy-html5/issues/378

One difference of course is that HTML Tidy writes only to html, and not to formats like Markdown meant to be read by people.

jgm commented 4 years ago

@cjerdonek I'm pretty sympathetic to your idea.

jgm commented 4 years ago

It's delicate though. We'd have to look at implications like #2377, and if you search the code for escapeURI, you'll see that it is used in many writers to check for links that should be plain URIs (autolinks in markdown). So lots of things might break if we made this change.

cjerdonek commented 4 years ago

Thanks, @jgm. I appreciate it.

I wonder if there's an incremental approach that could make things easier, for example by incrementally adding support on a per-output format basis. Or if it's an extension, support for the extension could be added for certain more popular output formats first.

If characters in an unescaped URI need to be escaped separately using an output-dependent method (e.g. backslash-escaping parentheses in the case of Markdown), then each output format would need to be looked at individually anyways. Percent-encoding the URI is just sort of a hammer approach that happens to solve the issue for a bunch of characters and output formats (but at the expense of losing the original URI). (However, I'll note that for Markdown at least, it still doesn't address parentheses, since those characters aren't getting quoted by escapeURI.)

I haven't looked at the internals myself, but maybe it could also help (at least as an intermediate step) to add to Pandoc's internal representation of a link the original, unescaped value, in addition to the escaped value (which is what's currently stored).

cjerdonek commented 4 years ago

However, I'll note that for Markdown at least, it still doesn't address parentheses, since those characters aren't getting quoted by escapeURI.

I made a separate issue for this point (#6536), since the fix is different.