syntax-tree / mdast-util-from-markdown

mdast utility to parse markdown
MIT License
212 stars 20 forks source link

mdast-util-find-and-replace with HTML inside the Markdown #13

Closed revolunet closed 3 years ago

revolunet commented 3 years ago

Parsing MDX containing HTML

Hi,

Not sure this is a bug or a feature;

I'm trying to parse MDX which can contain some HTML. At the end i'm looking to use mdast-util-find-and-replace to replace some text terms in the markdown.

While trying to parse some markdown with mdast-util-from-markdown, i find a difference when the markdown starts with HTML tags

When parsing Some <b>HTML</b>content the tree looks correct and the <b> is a mdxJsxTextElement. mdast-util-find-and-replace can then operate on the text element.

Capture d’écran 2021-02-03 à 11 34 26

When parsing <p>Some HTML content</p>, i only get a single html element containing the <p> tags, and in that case mdast-util-find-and-replace is unable to replace the text

Capture d’écran 2021-02-03 à 11 34 53

There is a full repro with failing test case here for illustration : https://codesandbox.io/s/addglossary-in-mdx-zxyzw?file=/src/index.js

Expected behavior

Even starting with html tags, the MDX should be fully parsed

Actual behavior

The markdown is parsed as a single html node

wooorm commented 3 years ago

If HTML is peeking through, it sounds like https://github.com/micromark/micromark-extension-mdx-md isn‘t used, which turns it off?

wooorm commented 3 years ago

For your getTooltipElement function, I suggest to also build a node: https://github.com/syntax-tree/mdast-util-mdx-jsx#mdxjsxtextelement

revolunet commented 3 years ago

fixed the getTooltipElement thanks.

Looks like it works by adding mdxMd to fromMarkdown.extensions :

const tree = fromMarkdown(content, {
  extensions: [syntax({ acorn: acorn }), mdxMd],
  mdastExtensions: [mdxJsx.fromMarkdown]
});

this is wonderful and kind of magic, thank you 💙

revolunet commented 3 years ago

so this is indeed a feature and not a bug :)

wooorm commented 3 years ago

HTML is a black box to markdown, with MDX, less so. But I thought I’d split the adding JSX vs removing HTML parts, it’s a bit confusing but at this level I thought it wise to split it!

Sounds interesting, what you’re doing with MDX at such a low level btw!

wooorm commented 3 years ago

You can also use all of MDX (agnostic to JS) with https://github.com/micromark/micromark-extension-mdx, and gnostic to JS with https://github.com/micromark/micromark-extension-mdxjs!

revolunet commented 3 years ago

Thanks for the explanation, it helps.

Our current use case is adding "tooltip" capabilities to various content, based on a glossary, so we just need to "wrap" these texts with some specific container. This will be used client-side to trigger the pretty tooltip.

wooorm commented 3 years ago

What’s the reason for going full on JSX, instead of something more “markdown-esque” (not necessarily better, just, different): https://github.com/remarkjs/remark-directive

revolunet commented 3 years ago

Well for some reason our users can author MDX content, that mix text and components; we went that way because using components sounded natural in the mdx/react ecosystem, and we didnt even think about markdown directives :)

wooorm commented 3 years ago

ahh okay, yeah MDX is good for devs! :+1:

revolunet commented 3 years ago

in some cases, we get full HTML as input, can i use the same process as MDX to parse and replace ?

For example, in this HTML, the 2 keyword in the first paragraph dont get replaced while the next ones are well replaced 🤔

<p><a href="#">The keyword</a>blabla et une keyword de plus.</p>
<p><strong>keyword</strong></p>
<p>another keyword</p>

here's a sandbox with the repro : https://codesandbox.io/s/addglossary-in-mdx-zxyzw?file=/src/index.test.js

wooorm commented 3 years ago

Is in HTML, or JSX?

Some extra info that might help here: There is also quite a bit of difference between mdx-js/mdx 1, mdx-js/mdx 2 (latest beta), and mdx-js/mdx (main branch, which is similar to https://github.com/wooorm/xdm). The first treats html/jsx more like how markdown sees it: a “black box”, meaning: it’s one HTML node, not separate tags with content in between. The rest parse it as separate tags with content in between.

Regardless, you might be better off doing this in rehype space, so the HTML (JSX?) is already parsed? And I’d suggest to build a node instead of injecting a string of html (https://codesandbox.io/s/addglossary-in-mdx-zxyzw?file=/src/glossary.js:1092-1181), if that’s possible?

revolunet commented 3 years ago

Hi,

I that case this is HTML; so i understand we should better use a different "pipeline" based on input type : MDX or HTML ? I had some hope we could have the same parser but we can differentiate inputs if this makes things easier.

About adding a node instead adding a string of html, i did it but met an unexpected issue. will make a repro for that

wooorm commented 3 years ago

The pipelines might be different, but I’m guessing that a lot of plugins can be reused? Including this plugin, if it’s working on hast, it can be used in both places. If the input is different, having different pipelines seems the best way to go, to me, otherwise you’re going to run into problems with style="..." being incorrect in React, but style={...} being incorrect in HTML, <img> without the closing slash, and similar problems.

And do let me know about that other issue!

revolunet commented 3 years ago

I've updated the codesandbox and simplfied the test cases to use a mdxJsxTextElement as replacement but for some reason terms are sometimes replaced twice 🤔

wooorm commented 3 years ago

Looks like a bug!

wooorm commented 3 years ago

Here’s a reduces test case:

const findAndReplace = require('mdast-util-find-and-replace')
const toMarkdown = require('mdast-util-to-markdown')

function replace(match) {
  return {type: 'strong', children: [{type: 'text', value: match}]}
}

const tree = {type: 'paragraph', children: [{type: 'text', value: 'a b'}]}

findAndReplace(tree, [['never matches'], ['b', replace]])

console.log(toMarkdown(tree))

Which yields:

a ****b****

Note that the doubling only occurs when [b, replace] is not the first pattern. And it doesn’t occur if a is removed

wooorm commented 3 years ago

Solved btw: https://github.com/syntax-tree/mdast-util-find-and-replace/commit/c85728a0f7fe5b374230ca11d05fa84b99f4a9f5