rsms / markdown-wasm

Very fast Markdown parser and HTML generator implemented in WebAssembly, based on md4c
https://rsms.me/markdown-wasm/
MIT License
1.51k stars 62 forks source link

Unexpected escaping within code blocks #3

Closed drwpow closed 3 years ago

drwpow commented 3 years ago

First of all: this is an amazing library! I love the anchor links auto-generated in headings. Fantastic 🎉

That said, there are some unexpected results when parsing code blocks. Here’s an example from Redux’s README:

  /**
-  * This is a reducer, a pure function with (state, action) => state signature.
+  * This is a reducer, a pure function with (state, action) => state signature.

 …

- store.subscribe(() => console.log(store.getState()))
+ store.subscribe(() => console.log(store.getState()))

This results in (caused mostly by the syntax highlighting library):

Screen Shot 2020-09-14 at 7 23 58 PM

Expected behavior would be to not HTML escape anything within those code blocks.

I don’t know of a README with an actual lesser-than comparison (e.g. x < y -> x &lt; y), but I’d suspect a similar behavior with that, too.

If desired, I could make an attempt at a PR (but be warned—I’m a C n00b).

rsms commented 3 years ago

I think I understand. Let me repeat your concern, phrased in a different way to see if I get it.

So, given this markdown source code:

```
a => b
```

You expect the following HTML output:

<pre><code>a => b</code></pre>

If this is correct then that is not going to work since > is a special character in HTML (as I'm sure you know.) I.e. if you instead consider this markdown:

```
<script>alert(document.cookies)</script>
```

You would get the HTML output:

<pre><code><script>alert(document.cookies)</script></code></pre>

This would be bad.


It seems to me that your end goal here is to process the code through a syntax highlighter. There may be better ways to go about that.

Option 1: you could use a syntax highlighter that works with HTML-escaped code, like highlight.js

Option 2: you could run the syntax highlighter on the markdown text, before you pass it on to markdown-wasm. However, if you do this, you won't be able to set NO_HTML_BLOCKS or NO_HTML_INLINE flags, which can be used to strengthen the safety of markdown-wasm, i.e. to avoid XSS issues.

Option 3: we could consider adding a feature to markdown-wasm where you set a flag, like for example CDATA_CODE_BLOCKS that, when set, outputs code blocks with verbatim code wrapped in <![CDATA[...]]>.

rsms commented 3 years ago

I've enabled highlight.js on the markdown-wasm website so you can try it out: https://rsms.me/markdown-wasm/#code-poetry

Try something like this and look at the HTML using your browser's web inspector:

```js
const f = () => {}
```
drwpow commented 3 years ago

That’s a fair point about injection. You’re right that things between <code></code> do need to be escaped sometimes; I was more-or-less wondering if => specifically needed to be escaped. I was comparing the output to remark-html which leaves as-is, but you have a good point in that =&gt; should be essentially the same.

I agree that probably the responsibility lies in the highlighting library and not this parser. Maybe it’s just a fluke/bug remark-html doesn’t escape => in certain scenarios.

Thanks for responding!