Raw HTML rendered as text

The problem

According to the original Markdown spec, the CommonMark spec and the GitHub Flavored Markdown spec, HTML blocks are specified as valid Markdown. However, attempting to use HTML in Markdown with svelte-exmarkdown, it will get rendered as text instead:

Input:

# Hello
<div>hello</div>

Output:

<h1>Hello</h1>
&lt;div&gt;hello&lt;/div&gt;

I understand that this might the be intended output for some use cases – for example if the Markdown source is to be specified by the user, this approach is the absolutely safest way to avoid XSS. However, there are situations where one might want to allow a subset of HTML, or even all HTML, if the Markdown source is trusted.

Detailed investigation

Logging the AST in each stage reveals that the parsed Markdown tree:

{
  type: "root",
  children: [
    {
      type: "heading",
      depth: 1,
      children: [{ type: "text", value: "Hello" }],
    },
    {
      type: "html",
      value: "<div>hello</div>",
    },
  ],
}

... gets turned to the following HTML AST:

{
  type: "root",
  children: [
    {
      type: "element",
      tagName: "h1",
      children: [{ type: "text", value: "Hello" }],
    },
    {
      type: "raw",
      value: "<div>hello</div>",
    },
  ],
}

I don't understand how exactly the renderer decides to render type: "raw" as a text node, but somehow it does.

Non-solution

Note, however, that this can't be solved by adding a rendering rule that would output {@html value}, because the raw HTML may contain unmatched tags. For example, consider this Markdown snippet:

<div>

# Hello
</div>

In the unified pipeline, it will get turned to the following HAST:

{
  type: "root",
  children: [
    { type: "raw", value: "<div>" },
    {
      type: "element",
      tagName: "h1",
      children: [{ type: "text", value: "Hello" }],
    },
    { type: "raw", value: "</div>" },
  ],
}

Attempting to render this HAST using the {@html ...} directive would output the following result:

<div></div>
<h1>Hello</h1>
<div></div>

(see for example this demonstration), rather than the intended tree:

<div>
  <h1>Hello</h1>
</div>

Walkaround

Since the AST can contain effectively unparsed tokens, the most straightforward and robust solution seems to be to stringify and re-parse it. This is an example of a svelte-exmarkdown plugin that does just that:

import type { Plugin } from 'svelte-exmarkdown';
import { unified } from 'unified';
import { toHtml } from 'hast-util-to-html';
import rehypeParse from 'rehype-parse';

export const rawHtml: Plugin = {
    rehypePlugin: () => (node) => {
        const str = toHtml(node, { allowDangerousHtml: true });
        console.log(str);
        return unified().use(rehypeParse, { fragment: true }).parse(str);
    },
};

A serious problem with this approach is that it cannot distinguish harmless HTML from a XSS attempt, so it's only ever useful if one has absolute control over the Markdown source. A somewhat less serious problem is the loss of the AST's metadata (eg. the position props).

ssssota / svelte-exmarkdown