micromark / micromark-extension-gfm-table

micromark extension to support GFM tables
https://unifiedjs.com
MIT License
6 stars 6 forks source link

severe performance issues with certain inputs #6

Closed vogon closed 1 year ago

vogon commented 2 years ago

Initial checklist

Affected packages and versions

micromark-extension-gfm-table 1.0.5

Link to runnable example

No response

Steps to reproduce

we're using remark-gfm as part of our markdown parsing stack on a deployed site which accepts user markdown, and by bisecting the parsing stack, we've noticed that certain inputs of similar size and content cause micromark-extension-gfm-table to run 120-fold slower.

minimized test case:

import fs from "fs";
import { micromark } from "micromark";
import { gfmTable, gfmTableHtml } from "micromark-extension-gfm-table";

const md = fs.readFileSync(process.argv[2]);

const output = micromark(md, {
    extensions: [gfmTable],
    htmlExtensions: [gfmTableHtml],
});

console.log(output);

inputs that reproduce this issue: bee-movie-repro.md

inputs that do not reproduce this issue: bee-movie-no-repro-2.md bee-movie-no-repro.md

Expected behavior

$ time node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-no-repro.md > /dev/null
node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-no-repro.md > /dev/null  0.16s user 0.02s system 132% cpu 0.135 total

$ time node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-no-repro-2.md > /dev/null
node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-no-repro-2.md > /dev/nul  0.37s user 0.04s system 166% cpu 0.247 total

edit to add: also, after removing gfmTable and gfmTableHtml from the list of extensions in the call to micromark():

$ time node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-repro.md > /dev/null
node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-repro.md > /dev/null  0.20s user 0.00s system 124% cpu 0.160 total

Actual behavior

$ time node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-repro.md > /dev/null
node main.mjs /mnt/c/Users/vogon/Downloads/bee-movie-repro.md > /dev/null  36.33s user 3.80s system 204% cpu 19.579 total

Runtime

Node v14

Package manager

pnpm

OS

Linux

Build and bundle tools

No response

vogon commented 2 years ago

it appears to be related to the number of contentful lines which are separated by only a single linebreak; a 1365-line file where each line just contains "a" (attached) also takes about 20 seconds to process.

$ time node main.mjs repro-test-1.md > /dev/null
node main.mjs repro-test-1.md > /dev/null  19.88s user 1.77s system 170% cpu 12.684 total

repro-test-1.md

vogon commented 2 years ago

we ended up having to mitigate this downstream by removing GFM extensions from the parsing stack on any input markup that contained more than 256 lines in a single paragraph.

for i in {1..1365}; do yes | head -n $i > test.md; echo $i: $(time node main.mjs test.md > /dev/null); done is the one-liner we used to come up with this cutoff, in case anyone else is going through similar issues.

for a ~200 line paragraph, this plugin takes 300ms to parse; at ~360, it hits 500ms; it crosses 1 second at ~500.

wooorm commented 2 years ago

I investigated this a bit today. The problem is very tough, because GFM tables have some weird cases where they match anything.

E.g.:

a
:-
b
==
c

a :- b

c


For now, if you control content, I recommend adding blank lines between tables and other things. It mostly solves this. I have a bit of an idea on how to improve it, but it requires changes to micromark itself as well. So this will be solved, but it has a low priority for me.

To add: https://github.com/sponsors/unifiedjs

craftzdog commented 1 year ago

What if you drop supporting those weird cases? lezer-markdown doesn't support it but I guess that's mostly fine.

Murderlon commented 1 year ago

There is also a need for a completely compliant and safe markdown parser. AFAIK few parsers are as correct in edge cases as micromark. If performance is really important, consider looking at the Rust version of micromark: https://github.com/wooorm/markdown-rs

craftzdog commented 1 year ago

Thanks @Murderlon for letting me now. I'll check it out. And thanks for the hard work, @wooorm !

DavidAnson commented 1 year ago

Hi all,

I landed here after looking into why inCellContentHead was the second-most time-consuming function in performance profiles of micromark. According to some VERY rough numbers, I estimate that micromark-extension-gfm-table takes about 40% of execution time on typical input (all the Markdown in the .NET documentation repository in this case). The example above is pretty extreme, but I wonder if either of these would be possible:

  1. Some kind of forward-scanning in start to produce an early-exit in cases where it's clear a table is not present at the current location. This scan doesn't even need to be perfect, it just needs to be right a lot of the time (and never wrong).
  2. A developer-level option to restrict scenarios to "normal" table scenarios. For example, I'd be fine if the example above were not recognized as a table. I might even be fine requiring tables start with | which seems trivial to handle in the code.

Correctness comes before performance, but performance is important, too. :) Thanks for your consideration!

wooorm commented 1 year ago

and never wrong

This is the complex part, because any line can be part of a table, according to GFM:

a
:-:
b
c

a :-: b c


I believe markdown-rs does not have this problem tho, so I do think we can get around it. But it needs an overhaul of the current algorithms here in JS!

performance is important, too

Quite! :+1:

DavidAnson commented 1 year ago

Ooookay, I'm not sure how I feel about this, but the following snippet defines gfmTableStrict which seems to be viable as a drop-in replacement for gfmTable to save something like 5% of time on typical input. (Sorry, I was hoping for better.) It requires that all tables begin with the | character which excludes weird scenarios like above in this thread. Sharing is caring, so FYI for the community...

Assuming:

import { codes } from "micromark-util-symbol/codes";
import { gfmTable } from "micromark-extension-gfm-table";

Then:

function tokenizeStrict(effects, ok, nok) {
  const start = gfmTable.flow.null.tokenize.call(this, effects, ok, nok);
  return (code) => {
    if (code !== codes.verticalBar) {
      // Reject tables that don't begin with "|" for performance reasons
      return nok(code);
    }
    return start(code);
  };
}
const gfmTableStrict = {
  "flow": {
    "null": {
      "resolve": gfmTable.flow.null.resolve,
      "tokenize": tokenizeStrict
    }
  }
};
github-actions[bot] commented 1 year ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

wooorm commented 1 year ago

Released in 1.0.6!

DavidAnson commented 1 year ago

Looks like I'll be benchmarking tonight... Thank you!

DavidAnson commented 1 year ago

Following up, it looks like this change reduces the runtime on my machine of the markdownlint profile script from 35s to 32s which is awesome. (That script does way more than just run micromark, so the direct benefits of this change are masked somewhat.) Thank you!