ssssota / svelte-exmarkdown

Svelte component to render markdown.
https://ssssota.github.io/svelte-exmarkdown
MIT License
199 stars 7 forks source link

Raw HTML rendered as text #84

Open cshaa opened 1 year ago

cshaa commented 1 year ago

The problem

According to the original Markdown spec, the CommonMark spec and the GitHub Flavored Markdown spec, HTML blocks are specified as valid Markdown. However, attempting to use HTML in Markdown with svelte-exmarkdown, it will get rendered as text instead:

Input:

# Hello
<div>hello</div>

Output:

<h1>Hello</h1>
&lt;div&gt;hello&lt;/div&gt;

I understand that this might the be intended output for some use cases – for example if the Markdown source is to be specified by the user, this approach is the absolutely safest way to avoid XSS. However, there are situations where one might want to allow a subset of HTML, or even all HTML, if the Markdown source is trusted.

Detailed investigation

Logging the AST in each stage reveals that the parsed Markdown tree:

{
  type: "root",
  children: [
    {
      type: "heading",
      depth: 1,
      children: [{ type: "text", value: "Hello" }],
    },
    {
      type: "html",
      value: "<div>hello</div>",
    },
  ],
}

... gets turned to the following HTML AST:

{
  type: "root",
  children: [
    {
      type: "element",
      tagName: "h1",
      children: [{ type: "text", value: "Hello" }],
    },
    {
      type: "raw",
      value: "<div>hello</div>",
    },
  ],
}

I don't understand how exactly the renderer decides to render type: "raw" as a text node, but somehow it does.

Non-solution

Note, however, that this can't be solved by adding a rendering rule that would output {@html value}, because the raw HTML may contain unmatched tags. For example, consider this Markdown snippet:

<div>

# Hello
</div>

In the unified pipeline, it will get turned to the following HAST:

{
  type: "root",
  children: [
    { type: "raw", value: "<div>" },
    {
      type: "element",
      tagName: "h1",
      children: [{ type: "text", value: "Hello" }],
    },
    { type: "raw", value: "</div>" },
  ],
}

Attempting to render this HAST using the {@html ...} directive would output the following result:

<div></div>
<h1>Hello</h1>
<div></div>

(see for example this demonstration), rather than the intended tree:

<div>
  <h1>Hello</h1>
</div>

Walkaround

Since the AST can contain effectively unparsed tokens, the most straightforward and robust solution seems to be to stringify and re-parse it. This is an example of a svelte-exmarkdown plugin that does just that:

import type { Plugin } from 'svelte-exmarkdown';
import { unified } from 'unified';
import { toHtml } from 'hast-util-to-html';
import rehypeParse from 'rehype-parse';

export const rawHtml: Plugin = {
    rehypePlugin: () => (node) => {
        const str = toHtml(node, { allowDangerousHtml: true });
        console.log(str);
        return unified().use(rehypeParse, { fragment: true }).parse(str);
    },
};

A serious problem with this approach is that it cannot distinguish harmless HTML from a XSS attempt, so it's only ever useful if one has absolute control over the Markdown source. A somewhat less serious problem is the loss of the AST's metadata (eg. the position props).

ssssota commented 1 year ago

I have implemented it this way with an understanding of this specification. The reason for this implementation is to mitigate risks like XSS by not rendering potentially dangerous HTML input as-is.

For circumventing this specification, I recommend using rehype-raw. You can enable this feature by checking the 'HTML' checkbox in the playground at https://ssssota.github.io/svelte-exmarkdown/.

On a slightly different note, I am currently working towards v3, and as part of that, I plan to update the documentation to make it more user-friendly.

Thank you for using the library and for creating the issue!