rsms / markdown-wasm

Very fast Markdown parser and HTML generator implemented in WebAssembly, based on md4c
https://rsms.me/markdown-wasm/
MIT License
1.51k stars 62 forks source link

Syntax highlighting hook #11

Open aminya opened 3 years ago

aminya commented 3 years ago

Marked supports providing a custom highlight function to colour the code. It would be nice if markdown-wasm provides a similar feature. This means that every time a <code> or a <pre><code> is seen, it should pass the code to the hook for external rendering.

Here is how marked allows it: https://github.com/atom-community/atom-ide-markdown-service/blob/9a95f3815bbcf2507a31235a9d2f9102192f92b2/src/renderer.ts#L33-L42

aminya commented 3 years ago

@rsms Could you shed some light on this issue? Is this possible given the current codebase?

rsms commented 3 years ago

That's a cool idea. It is likely to be a complicated addition as this is a WASM module, not JavaScript. Even if a js function is called as a callback, you will likely be paying a pretty steep perf cost.

aminya commented 3 years ago

So do you think that editing the HTML code blocks (adding colors) after getting the initial markdown would have better performance?

rsms commented 3 years ago

@aminya Okay, I've got a proposal in the main branch. It was quite a bit of effort as I had to do some maintenance on some tools (wasmc and emsdk docker image) but this approach actually works really well and is reasonably performant.

2938af6f0bb4bd555430af7a175b8626fbab6c93 adds onCodeBlock to ParseOptions with the following signature:

interface ParseOptions {
  /**
   * onCodeBlock is an optional callback which if provided is called for each code block.
   * langname holds the "language tag", if any, of the block.
   *
   * The returned value is inserted into the resulting HTML verbatim, without HTML escaping.
   * Thus, you should take care of properly escaping any special HTML characters.
   *
   * If the function returns null or undefined, or an exception occurs, the body will be
   * included as-is after going through HTML escaping.
   *
   * Note that use of this callback has an adverse impact on performance as it casues
   * calls and data to be bridged between WASM and JS on every invocation.
   */
  onCodeBlock? :(langname :string, body :UTF8Bytes) => Uint8Array|string|null|undefined
  // ...
}

Try it like this:

$ git checkout https://github.com/rsms/markdown-wasm.git
$ cd markdown-wasm
$ git checkout 571f5ceb09730208edcf8abb19444ed24432c337
$ npm install
$ npm run build
$ node example/example.js

Example program:

const fs = require("fs")
const md = require("../dist/markdown.node.js")
const source = fs.readFileSync(__dirname + "/example.md")
const outbuf = md.parse(source, {
  bytes: true,
  onCodeBlock(lang, body) {
    return body.toString().toUpperCase()
  },
})
fs.writeFileSync("example.html", outbuf)

Since WASM maintains an isolated "heap of memory", every call to the filter function implies one copy of the bytes (from a JS heap Uint8Array buffer to the WASM heap). Additionally if you do body.toString() to get a JS string (or decode the body, which is a view into the WASM heap) you pay 3x additional "tolls" of copying (WASM heap to JS heap as a UTF-16 string + return value JS string to JS heap Uint8Array + JS heap Uint8Array to WASM heap.) This is unavoidable if you deal with JS strings. I'm looking into the possibility to directly encode a JS string into WASM heap memory (see withUTF16Str in wlib.js) but it means that the data needs to be translated into UTF8 and copied anyways, but in WASM, which is likely going to be slower.

Anyhow, the performance is still really good. On my 2018 MacBook Pro, nodejs v16.1:

(repro with node example/example.js -bench)

rsms commented 3 years ago

Released alongside a security fix in v1.2.0 on NPM

rsms commented 3 years ago

@aminya Let me know if this is a working solution for you and we can close the issue.

aminya commented 3 years ago

Thank you so much! ❤️

The only issue that I have now is that onCodeBlock is a synchronous function, while my highlighting function is asynchronous. I might be able to use make-synchronous or deasync, but I was wondering if it is possible to make it async.

Regarding the string conversion, this video might help.

rsms commented 3 years ago

There's really no reasonably-affordable way to make it async as parsing is inherently synchronous. I'm surprised your syntax highlighter is async since it's really a CPU-bound thing with no I/O involved on a single-threaded platform (nodejs.) Perhaps look into using a different syntax highlighter? Alt you can do what is pretty common and apply post-processing to the output HTML (just grep for <code> blocks)

aminya commented 3 years ago

This code runs inside Atom, and the syntax highlighting is done by SuperString and TreeSitter, which are native code and run in parallel. If it was in JavaScript, there was no reason to make the function async, but this isn't the case here.

Anyways, thanks for your help!