Closed vogon closed 1 year ago
it appears to be related to the number of contentful lines which are separated by only a single linebreak; a 1365-line file where each line just contains "a" (attached) also takes about 20 seconds to process.
$ time node main.mjs repro-test-1.md > /dev/null
node main.mjs repro-test-1.md > /dev/null 19.88s user 1.77s system 170% cpu 12.684 total
we ended up having to mitigate this downstream by removing GFM extensions from the parsing stack on any input markup that contained more than 256 lines in a single paragraph.
for i in {1..1365}; do yes | head -n $i > test.md; echo $i: $(time node main.mjs test.md > /dev/null); done
is the one-liner we used to come up with this cutoff, in case anyone else is going through similar issues.
for a ~200 line paragraph, this plugin takes 300ms to parse; at ~360, it hits 500ms; it crosses 1 second at ~500.
I investigated this a bit today. The problem is very tough, because GFM tables have some weird cases where they match anything.
E.g.:
a
:-
b
==
c
c
For now, if you control content, I recommend adding blank lines between tables and other things. It mostly solves this. I have a bit of an idea on how to improve it, but it requires changes to micromark itself as well. So this will be solved, but it has a low priority for me.
What if you drop supporting those weird cases? lezer-markdown doesn't support it but I guess that's mostly fine.
There is also a need for a completely compliant and safe markdown parser. AFAIK few parsers are as correct in edge cases as micromark. If performance is really important, consider looking at the Rust version of micromark: https://github.com/wooorm/markdown-rs
Thanks @Murderlon for letting me now. I'll check it out. And thanks for the hard work, @wooorm !
Hi all,
I landed here after looking into why inCellContentHead
was the second-most time-consuming function in performance profiles of micromark
. According to some VERY rough numbers, I estimate that micromark-extension-gfm-table
takes about 40% of execution time on typical input (all the Markdown in the .NET documentation repository in this case). The example above is pretty extreme, but I wonder if either of these would be possible:
start
to produce an early-exit in cases where it's clear a table is not present at the current location. This scan doesn't even need to be perfect, it just needs to be right a lot of the time (and never wrong).|
which seems trivial to handle in the code.Correctness comes before performance, but performance is important, too. :) Thanks for your consideration!
and never wrong
This is the complex part, because any line can be part of a table, according to GFM:
a
:-:
b
c
a :-: b c
I believe markdown-rs
does not have this problem tho, so I do think we can get around it. But it needs an overhaul of the current algorithms here in JS!
performance is important, too
Quite! :+1:
Ooookay, I'm not sure how I feel about this, but the following snippet defines gfmTableStrict
which seems to be viable as a drop-in replacement for gfmTable
to save something like 5% of time on typical input. (Sorry, I was hoping for better.) It requires that all tables begin with the |
character which excludes weird scenarios like above in this thread. Sharing is caring, so FYI for the community...
Assuming:
import { codes } from "micromark-util-symbol/codes";
import { gfmTable } from "micromark-extension-gfm-table";
Then:
function tokenizeStrict(effects, ok, nok) {
const start = gfmTable.flow.null.tokenize.call(this, effects, ok, nok);
return (code) => {
if (code !== codes.verticalBar) {
// Reject tables that don't begin with "|" for performance reasons
return nok(code);
}
return start(code);
};
}
const gfmTableStrict = {
"flow": {
"null": {
"resolve": gfmTable.flow.null.resolve,
"tokenize": tokenizeStrict
}
}
};
Hi! This was closed. Team: If this was fixed, please add phase/solved
. Otherwise, please add one of the no/*
labels.
Looks like I'll be benchmarking tonight... Thank you!
Following up, it looks like this change reduces the runtime on my machine of the markdownlint
profile script from 35s to 32s which is awesome. (That script does way more than just run micromark
, so the direct benefits of this change are masked somewhat.) Thank you!
Initial checklist
Affected packages and versions
micromark-extension-gfm-table 1.0.5
Link to runnable example
No response
Steps to reproduce
we're using remark-gfm as part of our markdown parsing stack on a deployed site which accepts user markdown, and by bisecting the parsing stack, we've noticed that certain inputs of similar size and content cause micromark-extension-gfm-table to run 120-fold slower.
minimized test case:
inputs that reproduce this issue: bee-movie-repro.md
inputs that do not reproduce this issue: bee-movie-no-repro-2.md bee-movie-no-repro.md
Expected behavior
edit to add: also, after removing
gfmTable
andgfmTableHtml
from the list of extensions in the call tomicromark()
:Actual behavior
Runtime
Node v14
Package manager
pnpm
OS
Linux
Build and bundle tools
No response