remarkjs / ideas

Share ideas for new utilities and tools built with @remarkjs
https://remark.js.org
5 stars 1 forks source link

Async tokenizers #8

Closed thetutlage closed 6 years ago

thetutlage commented 6 years ago

By looking at the source code of remark-parse, I can confirm that tokenizers are not async in nature.

I am looking to see if there are any plans to make them async (happy to work on the feature)

Use case

My use case is simple, that I want to add support for partials by introducing custom grammar to the syntax.

[include="relative-path-to-markdown-file"]

Using the tokenizer, I simply include this file and re-process the content as markdown.

Alternatives

The file will be read using the fs module of Node.js. I can use readFileSync but that simply kills the purpose of having async API at first place.

Also any decent size project will never compile one file at a time. Processing multiple files in parallel will have no use when the internals are using the Node.js sync api (readFileSync).

Are you ready to work on the feature?

Yes

wooorm commented 6 years ago

I don’t believe there is any reason to do this in a tokeniser. Why not use a tokeniser to read it into some node. And then in a plugin, a transform that can already be async, replace it?

thetutlage commented 6 years ago

Doing it via transform doesn't seems like a good idea to me (correct me if I am wrong)

  1. After reading the contents of partial (which will be markdown), I cannot parse it inside transformer easily. Inside tokenizer, I can use this.tokenizeBlock and pass the markdown.
  2. Getting parser instance inside transformer is not possible, unless I build some mechanism to create a new instance of unified, apply all plugins and then pull the MDAST tree from it and merge it with the original tree.
  3. If I am using a plugin to generate the TOC, then the TOC plugin generate it's tree using the original content, since I am merging the partial tree inside transformer at a later stage. So if my partial includes headings, then their TOC will never be generated.
thetutlage commented 6 years ago

Infact, I am okay if async tokenizer is simply against the design of the library. However, I personally believe transformer is not the right place to achieve partials ( maybe my understanding )

Let me share the code I will write inside the transformer and maybe you can shed some light on that

const visit = require('unist-util-visit')
const unified = require('unified')
const markdown = require('remark-parse')
const squeezeParagraphs = require('remark-squeeze-paragraphs')

async function processPartial (node) {
    const partialPath = node.data.hProperties.path
    const content = await readFile(partialPath)
    unified()
    .use(markdown)
    .use(squeezeParagraphs)

    // Now I don't know how to get `MDAST` tree and merge it with the Node
}

function transformer (tree, file, next) {
    visit(tree, 'IncludeNode', visitor)

    let partialNodes = []

    function visitor (node) {
        partialNodes.push(node)
    }

    Promise
        .all(partialPaths.map(processPartial))
        .then(() => {
            next()
        })
        .catch(() => {
            next()
        })
}
wooorm commented 6 years ago

@thetutlage Heya, sorry for the delay!

For your approach, I suggest allowing a new processor as options: remark().use(remarkPartials[, otherProcessor]) (you could default to something, but that may pack unneeded code in the browser).

The thing is: you can’t trust that the current processor (this in the attacher), has the remark parser. Maybe it’s set up to read HTML instead (rehype-parse), later transforming it to markdown (rehype-remark).

I’d suggest something along these lines (not tested, but something along these lines!):

var vfile = require('to-vfile')
var visit = require('unist-util-visit')

module.exports = partials

function partials(processor) {
  var parser = this.Parser

  if (!processor) {
    throw new Error('Processor required')
  }

  if (isRemarkParser(parser)) {
    // Add the tokenizer to `parser`
  }

  return transform

  function transform(tree, file, next) {
    var count = 0

    visit(tree, 'partial', visitor)

    done()

    function done(err) {
      if (err) {
        next(err)
      } else if (!count) {
        next()
      }
    }

    function visitor(node, _, parent) {
      var fp = node.url // Resolve the path relative to `file.path`?

      count++
      vfile.read(fp, onfile)

      function onfile(err, subfile) {
        if (subfile) {
          processor.run(processor.parse(subfile), one)
        }
        done(err)
      }

      function one(err, subtree, subfile) {
        var pos
        if (subtree) {
          // Maybe do something with `subfile.messages`?
          pos = parent.indexOf(node) // Could be that two partials resolved out of order, so can’t trust `_`
          parent.children[pos] = subtree
          count--
        }
        done(err)
      }
    }
  }
}

function isRemarkParser(parser) {
  return Boolean(
    parser &&
      parser.prototype &&
      parser.prototype.inlineTokenizers
  )
}
thetutlage commented 6 years ago

Hello @wooorm this makes sense, thanks for the help 😄

jashmenn commented 5 years ago

For those landing here, see https://github.com/temando/remark-gitlab-artifact/blob/master/src/index.js for a solution to this idea of performing async operations based on MDAST nodes