Parsing HTML tags inside Markdown

rChaoz commented 7 months ago

Problem

Consider the following Markdown:

This is some Markdown with a <kbd>Kbd</kbd> tag

This results in a tree similar to this:

I would expect the tree to be:

"This is some Markdown with a "
<kbd> element
- "Kbd"
" tag"

Instead, it is:

"This is some Markdown with a "
<kbd> (raw)
"Kbd"
</kbd> (raw)
" tag"

This causes issues as, for example, I'm using rehype-class-names to apply the correct classes to tags for styling, and it doesn't apply classes to the kbd element in this example.

Questions

I would believe this is intended - deal with markdown, leave everything else as-is (with the "raw") type. However, is there a way to achieve what I'm trying to do?

Also, I'm talking pure .md files, not Svelte-Markdown mix (.svx), so there is only pure HTML to parse in my case.

PeppeL-G commented 4 months ago

I'm no markdown expert, but I've looked into this a little bit.

Markdown doesn't know HTML. So when parsing the markdown code in the first step, the best the parser can do is to produce {"type": "html", "value": "<kbd>"} nodes whenever it encounters HTML syntax. Since it doesn't know HTML, it has no idea if <kbd> is a void element (that can't be nested) or an ordinary element (that can be nested), and just produces a sequence of nodes related to the HTML code, and doesn't try to build a proper HTML tree.

The obtained Markdown Abstract Syntax Tree (mdast) is then converted into an HTML AST (hast) using remark-rehype. Rehype is aware of HTML works, but this step doesn't try to build a proper HTML tree either. Instead, it just converts the mentioned node to {"type": "raw", "value": "<kbd>"}. But there is a plugin named rehype-raw you can run on this hast to make it produce the structure you wanted. So run that plugin before you run the rehype-class-names, and I think it will work.

For documentation, here's the code I played around with:

import remarkParse from 'remark-parse'
import remarkRehype from 'remark-rehype'
import { unified } from 'unified'
import rehypeStringify from 'rehype-stringify'
import rehypeRaw from 'rehype-raw'

const doc = `
This <kbd>Kbd</kbd> tag
`

const file = await unified()
    .use(remarkParse,)
    .use(() => function(tree){
        console.log(`mdast`, JSON.stringify(tree, null, "  "))
    })
    .use(remarkRehype, {allowDangerousHtml: true})
    .use(() => function (tree) {
        console.log(`hast 1`, JSON.stringify(tree, null, "  "))
    })
    .use(rehypeRaw)
    .use(() => function(tree){
        console.log(`hast 2`, JSON.stringify(tree, null, "  "))
    })
    .use(rehypeStringify)
    .process(doc)

console.log(String(file))

rChaoz commented 4 months ago

Thank you! This is exactly what I needed, it turns the raw nodes into regular HTML nodes so the following rehype plugins work correctly.

I think it might be a good idea to add this plugin directly into MDsveX, as it already uses remark-rehype with allowDangerousHtml: true, so it intended to allow HTML inside the Markdown content.

pngwn / MDsveX

Parsing HTML tags inside Markdown #598

Problem

Questions