valeriangalliat / markdown-it-anchor

A markdown-it plugin that adds an `id` attribute to headings and optionally permalinks.
The Unlicense
291 stars 72 forks source link

Headers in html_block Tokens aren't given slugs #105

Closed shellscape closed 3 years ago

shellscape commented 3 years ago

When headers (<h2 .../> etc) are within tokens with .type === 'html_block' the plugin doesn't assign IDs nor create slugs. I've been using Vue's README.md as a baseline for that https://github.com/vuejs/vue/blob/dev/README.md

valeriangalliat commented 3 years ago

Oh, that's a tricky one. Currently markdown-it-anchor works by identifying the Markdown headings in the markdown-it token stream (the heading_open tokens), and doesn't do anything with html_block tokens (which is what the Vue.js readme is from markdown-it's perspective) as you noticed.

Adding anchors inside plain HTML would mean parsing that HTML in order to apply the anchor logic to it, which would mean writing equivalent HTML DOM renderers for each of the existing "markdown-it token stream" permalinks, and ensuring they behave in a consistent way.

Or at that point to avoid duplicating the logic, changing the way markdown-it-anchor works so that it's not a markdown-it plugin anymore but an HTML post-processor, e.g. you could do htmlAnchor(mdit.render(markdownString)) and get the desired output, regardless if the headings were defined with Markdown or HTML syntax, but that's out of scope for markdown-it-anchor as it's by design a markdown-it plugin.

I never looked for such a thing and after a very quick search I didn't find an existing library to do that on the backend (it's all in-browser JS solutions...). I would encourage you to write such a library as in retrospective, adding permalinks is probably more of an HTML post-processor job than a Markdown parser plugin job.

That being said, a quick solution would be to make a markdown-it plugin that's inserted before markdown-it-anchor and that identifies such headings in html_block tokens and converts them to heading_open, inline and heading_close tokens so that markdown-it-anchor can work with them. That should be easy to do in the Vue.js readme case, but it seems like a pretty complex problem if you want to support more than the happy path.

shellscape commented 3 years ago

What I've done as a stop-gap is to use cheerio on html_block nodes, and look for headers (h1+) and duplicate some of the stuff that this plugin is doing. It's not ideal, but it gets me part of the way there. cheerio is pretty lightweight, but I'm not sure about its browser compatibility or impact (if that audience is one of your targets for this plugin).

If cheerio use isn't something you're on board with (and I wouldn't blame you), could a hook function in options be implemented for html_block nodes, which would provide access to the opts and allow us to reuse the permalink methods in this plugin? That way, users like myself could keep the cheerio install local and not burden the plugin with html-processing.

valeriangalliat commented 3 years ago

I don't want to force the dependency on cheerio, but peerDependencies would help with that. If cheerio is explicitly installed, we use it to process HTML blocks, otherwise the feature is off.

But that's the tip of the iceberg. markdown-it-anchor is meant to deal with a markdown-it token stream both as input and output. We can use cheerio to identify matching headings in html_block tokens, then building a "fake" token stream for it, e.g.:

const tokens = [
  { type: 'heading_open', attributes: [...], level: ... },
  { type: 'inline', children: [...] },
  { type: 'heading_close' }
]

Then calling the anchor renderer like anchor.permalink.whatever('slug', {}, tokens, 0), and using the markdown-it renderer to get the output HTML for the modified token stream, and putting that back inside the DOM with cheerio, and finally serializing the whole thing back to HTML to update the initial html_block contents. But even then to build the children list you'd need essentially a HTML/cheerio to markdown-it token stream parser, which seems like a pretty complex piece of software to write.

Or alternatively, finding a way to reliably turn <div>arbitrary html<h2>foo</h2>arbitrary html<h2>bar</h2>arbitrary html</div> into a stream like:

Then this could be done as a pre-markdown-it-anchor transform, which would be better as it wouldn't require any code change in the core of markdown-it-anchor, but you still have the same challenge about building the inline children tokens.

And I'm not sure cheerio would support splitting the HMTL that way, but it's something to look into.


While the happy path seems easy to implement, I can see many ways this would break or give unexpected results, and I don't want to include a just-the-happy-path solution in markdown-it-anchor.

And I genuinely think there's less work in making an HTML post-processor that adds permalinks the same way markdown-it-anchor does than trying to make markdown-it-anchor's not-at-all-designed-for-html code work with HTML (without essentially rewriting the core of the plugin as well as existing permalink renderers).

I'll leave that issue open if anybody wants to tackle this or can think of better solutions than those. :)

shellscape commented 3 years ago

All that makes total sense, thanks for taking the time on this. I'll give this some more thought today. It's a need for us, as we're parsing a lot of markdown from Github, so I'll definitely stay on top of it.

valeriangalliat commented 3 years ago

Awesome, I'm curious to see what you come up with. FWIW if you need extra help on this issue, I have up to 20 hours I can dedicate to contracting next week or in the second half of September. Feel free to reach out to my personal email for this :)

shellscape commented 3 years ago

I've got a working POC for converting html headings into heading_* tokens, but it's not pretty. Going to wait on some guidance from the markdown-it maintainers. This may be out of scope for this plugin, and I'm starting to think anything that processes additional HTML into tokens should probably be at the top of the stack. Will update when I have more to share.

shellscape commented 3 years ago

OK so got some feedback from markdown-it and the implementation looks OK. There are probably edge cases that haven't been considered, but here's my implementation:

import cheerio from 'cheerio';
import MarkdownIt from 'markdown-it';
import Token from 'markdown-it/lib/token';

export default function htmlHeaders(md: MarkdownIt) {
  md.core.ruler.after('inline', 'html-headers', (state) => {
    state.tokens.forEach((blockToken) => {
      if (blockToken.type !== 'html_block') {
        return;
      }
      const $ = cheerio.load(`${blockToken.content}`, { xmlMode: true });
      const headings = $('h1,h2,h3,h4,h5,h6');

      if (!headings.length) {
        return;
      }

      const { map } = blockToken;

      headings.each((_, e) => {
        const { tagName } = e;
        const level = parseInt(tagName.substring(1), 10);
        const markup = ''.padStart(level, '#');
        const element = $(e);

        const open = new Token('heading_open', tagName, 1);
        open.markup = markup;
        open.map = map;

        Object.entries(e.attribs).forEach(([key, value]) => {
          open.attrSet(key, value);
        });

        const content = new Token('text', '', 0);
        content.map = map;
        content.content = element.text() || '';

        const body = new Token('inline', '', 0);
        body.content = content.content;
        body.map = map;
        body.children = [content];

        const close = new Token('heading_close', tagName, -1);
        close.markup = markup;

        const position = state.tokens.indexOf(blockToken);
        state.tokens.splice(position, 0, open, body, close);

        element.remove();
      });

      // eslint-disable-next-line no-param-reassign
      blockToken.content = $.html();
    });

    return false;
  });
}

That has compatibility with this plugin as well as the markdown-it-toc-done-right plugin. It would be cool to see this plugin handle html headings as well, but I would certainly understand if you felt this should be a separate plugin.

valeriangalliat commented 3 years ago

Wow that's pretty cool! I just found out on your other issue that npm was using markdown-it and a similar approach to parse HTML headers, I had no idea.

At that point I would just reuse npm's implementation, e.g. md.use(require('@npmcorp/marky-markdown/lib/plugin/html-heading')), even though the implementation is likely to give unexpected results on edge cases, if it works good enough for npm it's probably going to be ok in most cases.

I'll add a line to document that in the readme because it's good to know, but I wouldn't add that code to markdown-it-anchor unless it actually works with edge cases too.

But at that point I think it would be easier to add the anchors on the HTML rather than during the Markdown parsing phase, e.g.:

const { parse } = require('node-html-parser')
const root = parse(html)

for (const h of root.querySelectorAll('h1,h2,h3,h4,h5,h6')) {
  const slug = h.getAttribute('id') || slugify(h.textContent)
  h.setAttribute('id', slug)
  h.insertAdjacentHTML('afterbegin', `<a class="header-anchor" href="#${slug}">#</a> `)
}

console.log(root.toString())
shellscape commented 3 years ago

Yeah I think we can close this one. I do want to note that NPM's code doesn't pick up headers with attributes, as their regex is pretty constrained. I'm left wondering if they use this code at all anymore. See the Vue rendered readme here: https://www.npmjs.com/package/vue. I ran that readme through marky-markdown and that Supporting Vue.js header wasn't assigned an anchor, yet on npmjs.com it is. I ended up sticking with an implementation that uses cheerio to catch that during markdown parsing - the number one reason was markdown-it-toc-done-right compatibility - if the headers are added after the fact, they don't make it into the toc.

valeriangalliat commented 3 years ago

Oh wow, right. The more I think about it, the more I feel like all of this (anchors, TOC) should be done at the HTML level instead of Markdown, which is likely how GitHub not only has anchors and TOC on Markdown files, but also RST, and Org mode

shellscape commented 3 years ago

That's a valid point.