turndown not escaping links properly

mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript

https://mixmark-io.github.io/turndown

MIT License

8.93k stars 880 forks source link

turndown not escaping links properly #459

Closed shaipetel closed 7 months ago

shaipetel commented 7 months ago

I noticed this scenario and think it would be useful for others if turndown would handle this better.

Background: ( and ) are valid characters in a URL, and won't get escaped in normal HTML. In markdown, links are surrounded by ( and ), if your link needs to have ( and ) you will need to escape them with \

For this input html:

<a href="https://google.com/file(1).jpg">link</a>

A valid markdown should be:

[link](https://google.com\(1\).jpg)

However turndown returns:

[link](https://google.com(1).jpg)

I'm now pre-fixing image and links in my html prior to calling the turndown service, but I really think this should be handled in the parser. When you do, be sure not to double-escape URLs if there were already escaped (meaning, if the URL has "(1).jpg" there is no need to double-escape it.

pavelhoral commented 7 months ago

Good catch. Btw. according to the spec it seems that parentheses does not have to be escaped as long as they are balanced. So in your example you should be fine without any preprocessing.

I don't understand your last comment about "double-escaping URLs". Can you elaborate a bit more?

shaipetel commented 7 months ago

Yeah, sure. If someone (like me lol) loads an html:

<a href="https://google.com/file(1).jpg">link</a>

But, protected it before calling turndown, due to this issue:

<a href="https://google.com/file\(1\).jpg">link</a>

See that the parentheses were already escaped and leave them be.

FYI - I tried 2 markdown editors, they both automatically escape links that have (1). You are right, they can load a markdown file that have (1) in a link - but they escape it as soon as its loaded, and if I save the file they will save it escaped.

pavelhoral commented 7 months ago

<a href="https://google.com/file\(1\).jpg">link</a>

I see. But that would be invalid URL in the HTML input. I don't think this library should take something like that into account.

I think we should add "aggressive" parenthesis escaping (i.e. don't bother checking if the parenthesis are balanced, simply escape them every time). Such behavior is in line with https://github.com/mixmark-io/turndown?tab=readme-ov-file#escaping-markdown-characters.

Such change might be quite simple - adding test case to [test/index.html](https://github.com/mixmark-io/turndown/blob/master/test/index.html and adding replace to commonmark-rules.js#.

shaipetel commented 7 months ago

<a href="https://google.com/file\(1\).jpg">link</a>
I see. But that would be invalid URL in the HTML input. I don't think this library should take something like that into account.

I think we should add "aggressive" parenthesis escaping (i.e. don't bother checking if the parenthesis are balanced, simply escape them every time). Such behavior is in line with https://github.com/mixmark-io/turndown?tab=readme-ov-file#escaping-markdown-characters.

Such change might be quite simple - adding test case to [test/index.html](https://github.com/mixmark-io/turndown/blob/master/test/index.html and adding replace to commonmark-rules.js#.

Yeah, totally agree! thanks.

shaipetel commented 4 months ago

I finally got around to verify this issue, however it was fixed for links but not for images.

this html:

<img src="https://google.com/file 1).jpg" />

produces this markdown:

![](https://google.com/file 1).jpg)

In this example I used an unbalanced bracket, as per earlier comments. I can confirm the fix for links handles both balanced and unbalanced brackets, which would be my preferred fix.

using 7.2.0