jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.57k stars 3.38k forks source link

URI encoding can produce malformed URIs #7999

Open jheer opened 2 years ago

jheer commented 2 years ago

When converting Markdown links, some URI content is encoded; however, it appears that the percent character (%) is excluded. This can result in malformed URIs.

To reproduce: https://pandoc.org/try/?text=%5Blink%5D(https%3A%2F%2Fexample.com%2F%3Fparam%3D%7B%252%7D)&from=markdown&to=markdown&standalone=0

Now take the resulting URI and try the following in JavaScript:

decodeURI('https://example.com/?param=%7B%2%7D')

The result is URIError: malformed URI sequence. A solution could be to simply uri-encode the percent character as %25. Another (separate) possibility would be to optionally disable URI encoding, as discussed in #6525.

jgm commented 2 years ago

The assumption is that if you use a % sign in a URI, you're doing percent-encoding, so we don't touch it.

jheer commented 2 years ago

Thank you for clarifying. Unfortunately incorrect URIs can result from the current approach. FWIW, the ability to opt-out of URI encoding would be a sufficient solution for my particular use case.

In any case, thank you for pandoc! It is an incredible resource.

jgm commented 2 years ago

Opting out of URI encoding would still allow incorrect URIs to be produced, just like now, right?

jheer commented 2 years ago

Certainly, document authors might insert arbitrary (including broken) URIs. All manner of characters might show up in an unencoded URI query string, which I assume is part of the motivation for automatically applying a URI encoding in pandoc. Unfortunately, under the current strategy, if such query strings include a percent character the result can get garbled.

My example above demonstrates pandoc producing a malformed URI. The current pandoc behavior can also produce well-formed but semantically different URIs. For example, imagine a web page that accepts a math formula within a query string: if an (unencoded) input URI looks like "http://example.com/formula?x%25", instead of "x mod 25" the (decoded) result out of pandoc corresponds to "x%". Applying URI encoding piecemeal can produce inconsistent results.

Of course, other approaches to automatic encoding might bring their own problems. My own use case may be less common. I'm trying to use pandoc as a parser, then perform my own downstream processing of the resulting AST. I'm happy to handle URIs on my own, and so would be content to configure pandoc to simply step out of the way. Thanks!

jgm commented 2 years ago

There are actually two pieces to this, and I've struggled to find the right solution:

  1. How should we parse links (and similarly images) in Markdown? That is, what string do we store in the AST for the link destination? (Note: not everyone is targeting HTML, and for an image, this might simply be a filename.)
  2. How should we render the link destination in HTML and other formats?

Currently the HTML writer doesn't do any percent-encoding. Pandoc's markdown reader has always encoded the URI (#1), but in commonmark I used a different approach, storing the raw string. You can see the difference here:

 % pandoc -t native
[link]({%2})
[ Para
    [ Link ( "" , [] , [] ) [ Str "link" ] ( "%7B%2%7D" , "" ) ]
]
% pandoc -t native -f commonmark
[link]({%2})
[ Para
    [ Link ( "" , [] , [] ) [ Str "link" ] ( "{%2}" , "" ) ]
]

In fact, the commonmark reader (and variants of it like gfm) will behave the way you want here.

% pandoc -t native -f commonmark -t html
[link]({%2})
<p><a href="{%2}">link</a></p>

I'm tempted to bring the markdown reader into conformity with this, but this might break things that work now.