showdownjs / showdown

A bidirectional Markdown to HTML to Markdown converter written in Javascript
http://www.showdownjs.com/
MIT License
14.26k stars 1.56k forks source link

HTML inside <code> elements is escaped #819

Open ExpHP opened 4 years ago

ExpHP commented 4 years ago

Input Markdown:

<code><span>a</span></code>

Expected HTML output: (from 28 out of 31 converters tested on babelmark)

<p><code><span>a</span></code>
</p>

Actual output: (from showdown and one other parser)

<p><code>&lt;span&gt;a&lt;/span&gt;</code>
</p>

Quoting Daringfireball: (emphasis added)

Similarly, because Markdown supports inline HTML, if you use angle brackets as delimiters for HTML tags, Markdown will treat them as such. But if you write:

4 < 5

Markdown will translate it to:

4 &lt; 5

However, inside Markdown code spans and blocks, angle brackets and ampersands are always encoded automatically. This makes it easy to use Markdown to write about HTML code. (As opposed to raw HTML, which is a terrible format for writing about HTML syntax, because every single < and & in your example code needs to be escaped.)

I included the last paragraph to emphasize that it says "Markdown code spans." My interpretation of this—backed by the babelmark link posted above—is that this phrase refers specifically to markdown backtick syntax, i.e. `<span>a</span>`, and not to <code> which is an inline HTML element.

haydenlinder commented 3 years ago

Did you find a solution?

ExpHP commented 3 years ago

That really depends on how you define "solution."

https://github.com/ExpHP/thpages/blob/ab4512b2839ab56490c1c38fccaf14c4604080fc/js/markdown.ts#L39-L49

And that's only made somewhat simple by relying on a few key facts about the language I'm highlighting. The workaround I used to use was a bit more general, but also completely unmaintainable.

https://github.com/ExpHP/thpages/blob/d1a5750184273de90ef22ddf10dd84cdeb27ee0c/js/showdown-ext.js#L16-L39

(also, both of these were only run on trusted input)

ExpHP commented 3 years ago

Oh, apparently I mis-remembered this issue and mistakenly thought it was an issue in highlightjs rather than showdown.

I think those snippets I posted are still related, but IIRC they are working around an even greater issue (which arises from the interaction between this bug and highlightjs), so my apologies if they seemed confusing. (Then again, the point was mainly just to show that I don't have a good solution)

haydenlinder commented 3 years ago

After reading this comment https://github.com/showdownjs/showdown/issues/400#issuecomment-307668667 I see that it escapes angle brackets by design. I added a plugin to unescape them like so:

const showdown = require('showdown');

unescapeAngleBrackets: [
    {
        type: 'output',
        regex: new RegExp(`&lt;`, 'g'),
        replace: `<`
    },
    {
        type: 'output',
        regex: new RegExp(`&gt;`, 'g'),
        replace: `>`
    }
]

const converter = new showdown.Converter({
    extensions: [
        ...unescapeAngleBrackets,
    ]
})
ExpHP commented 3 years ago

Well, there's a trick here. Markdown codespans SHOULD be escaped, but HTML codespans should NOT. This means

`<a></a>`

<code><a></a><code>

should become

<code>&lt;a&gt;&lt;/a&gt;</code>

<code><a></a><code>

I thought, maybe this is because they might implement the escaping after the conversion of ` ` to <code>, when they should rather do it as part of that conversion. Looking at the code, however, it does seem that it is deliberate (just a misunderstanding of the spec), and this line which appears to perform the escaping on <code> should probably be eliminated:

https://github.com/showdownjs/showdown/blob/a9f38b6f057284460d6447371f3dc5dea999c0a6/src/subParsers/makehtml/hashCodeTags.js#L9

ExpHP commented 3 years ago

Actually, looking at this reveals even more bugs. hashCodeTags is also "hashing" the contents of <code>, which AFAICT stops showdown from recursing into it. This results in an even wider class of bugs:

Input:

<code>`a`</code>

Correct output: (babelmark2). <code> should be treated as any other span element, and therefore Markdown links and codespans inside of it should be converted.

<p><code><code>a</code></code>
</p>

Showdown output:

<p><code>`a`</code>
</p>