jgm / djot.js

JavaScript implementation of djot
MIT License
155 stars 18 forks source link

Get faster #13

Open jgm opened 1 year ago

jgm commented 1 year ago

Benchmarks are currently running about 3X slower than for commonmark.js. It would be good to understand why and narrow the gap.

jgm commented 1 year ago

Funny thing is that when I benchmark the commonmark.js CLI against the djot one, the djot one is always slightly faster. So I am not sure how to interpret these results. It could be that there's a bit more startup overhead in each conversion for djot (perhaps because of how parser objects are created) and that this predominates in repeated tests parsing very small texts. When longer texts are using, the difference in startup overhead gets swamped and djot comes out ahead.

It may be that there's some low hanging fruit to be plucked, but I'd probably need guidance from someone who is more experienced in optimizing JS.

matklad commented 1 year ago

As a data-point, for me djot takes about 1.75x more time than common mark when rendering a 3.5mb file.

Here's a flamegraph for djot, haven't look at it yet: https://share.firefox.dev/3IuG1qg

jgm commented 1 year ago

This is using it with node, or in the browser?

matklad commented 1 year ago

With node

matklad commented 1 year ago

This part looks potentially optimizable:

    for (let i = this.firstpos; i <= this.lastpos; i++) {
      if (this.matches[i]) {

If I understand this correctly, here we go through every "char" position and do a hashmap lookup. I think the map should be pretty sparse, so it'd be more efficient to convert matches to array and sort them by pos rather than enumerating every pos

jgm commented 1 year ago

this.matches actually is an array -- though it's a sparse array and perhaps that's implemented just as a hash map. Anyway, the reason it needs to be indexed by pos is that we sometimes override an earlier match, e.g. * might first be categorized as str but then later +emph.

If we just used an array, we could perhaps just add the +emph with the same position later in the array, then sort the array, then remove the first of every pair of adjacent elements with the same startpos, or something like that.

matklad commented 1 year ago

Yeah, there's some water to squeeze out this stone: if I replay array with Map, I get 10% win. Though, I commented out the code that handles hardline breaks, as that uses the last element of the array, and we don't have that with Map.

matklad commented 1 year ago

I am wondering: can we refer to an earlier match not by codepoint offset, but by its position in the matches array? That is, we accumulate events in an array, events are generally .sorted by their startpos, but, if we change our mind about some past events, we go to their position in an array and replace them with tombstones.

I use a similar device in rust-analyzer:

https://github.com/rust-lang/rust-analyzer/blob/80cabf726068187d8686b5ccf37aac484da84904/crates/parser/src/event.rs#L20-L81

Parsing events are accumulated into a flat array, but events can be tombstones, or they can refer to future events.

(the backstory here is that, while brute-force replacement of array with map speeds things up, getMaches still is the hottest single-indentifiedable place in the profile, as we replaced a lot of negative hash map lookups with array allocation and sorting; so I am wondering if we can just be smarter with dumb dense arrays here)

jgm commented 1 year ago

Yes, this sounds like a plausible idea.

The main change would be adding to Opener a field for the index into the matches array (since we wouldn't be able to use the startpos for this). This would have to be added in addOpener. It would also be a good idea to incorporate adding the "default match" to addOpener; this would ensure we get the right index. Currently we do addOpener and then addMatch to add the default match.

Other cases to handle:

And then we need to figure out what the tombstone is and how to strip it out. It could just be null, I suppose, and we could filter the array on !== null before sorting? But maybe v8 would prefer if we used the same type throughout, in which case we could have annot = "tombstone".

[I can try this.]

jgm commented 1 year ago

I've made a stab at this in #24. I think the code is more elegant (and shorter) now. But two tests fail and I haven't yet gotten to the bottom of it.

matklad commented 1 year ago

Looked at the perf profile again this morning. Sadly, nothing more stands out: at this (function) level of granularity, the profile looks mostly flat, it's "stuff takes time" basically. I think we'll need some sort of line profiler to get further along, but quick googling didn't show how to get that with JS.

Nonetheless:

jgm commented 1 year ago

I adjusted the commonmark.js profiling so it matches what we're doing (parsing only, constructing object inside the profiling loop), only with cm versions of our dj files.

The differences are rather astounding, though I must say I don't understand why the observed differences on real-world conversions are not very large:

block-bq-flat.md 263,041 ops/sec ±0.21% (100 runs sampled)
block-bq-nested.md 146,306 ops/sec ±0.18% (100 runs sampled)
block-code.md 555,951 ops/sec ±0.13% (100 runs sampled)
block-fences.md 600,999 ops/sec ±0.12% (99 runs sampled)
block-heading.md 283,913 ops/sec ±0.18% (100 runs sampled)
block-hr.md 437,882 ops/sec ±0.14% (100 runs sampled)
block-html.md 154,155 ops/sec ±0.21% (99 runs sampled)
block-lheading.md 386,463 ops/sec ±0.16% (100 runs sampled)
block-list-flat.md 34,534 ops/sec ±0.16% (101 runs sampled)
block-list-nested.md 34,284 ops/sec ±0.19% (101 runs sampled)
block-ref-flat.md 85,683 ops/sec ±0.22% (94 runs sampled)
block-ref-nested.md 73,614 ops/sec ±0.18% (95 runs sampled)
inline-autolink.md 92,900 ops/sec ±0.16% (101 runs sampled)
inline-backticks.md 451,831 ops/sec ±0.16% (100 runs sampled)
inline-em-flat.md 120,869 ops/sec ±0.21% (100 runs sampled)
inline-em-nested.md 168,278 ops/sec ±0.17% (101 runs sampled)
inline-em-worst.md 170,375 ops/sec ±0.17% (101 runs sampled)
inline-entity.md 193,644 ops/sec ±0.19% (98 runs sampled)
inline-escape.md 131,929 ops/sec ±0.15% (101 runs sampled)
inline-html.md 57,738 ops/sec ±0.18% (100 runs sampled)
inline-links-flat.md 52,047 ops/sec ±0.19% (97 runs sampled)
inline-links-nested.md 104,131 ops/sec ±0.16% (101 runs sampled)
inline-newlines.md 187,799 ops/sec ±0.25% (100 runs sampled)
lorem1.md 72,545 ops/sec ±0.16% (100 runs sampled)
README.md 7,623 ops/sec ±0.24% (99 runs sampled)

And here is djot.js:

parse block-bq-flat.dj x 97,828 ops/sec ±1.66% (88 runs sampled)
parse block-bq-nested.dj x 48,389 ops/sec ±0.39% (101 runs sampled)
parse block-code.dj x 214,310 ops/sec ±0.43% (100 runs sampled)
parse block-fences.dj x 205,352 ops/sec ±0.46% (96 runs sampled)
parse block-heading.dj x 70,228 ops/sec ±0.38% (101 runs sampled)
parse block-hr.dj x 77,164 ops/sec ±0.39% (95 runs sampled)
parse block-list-flat.dj x 10,230 ops/sec ±0.69% (99 runs sampled)
parse block-list-nested.dj x 13,184 ops/sec ±0.88% (99 runs sampled)
parse block-ref-flat.dj x 20,696 ops/sec ±0.33% (99 runs sampled)
parse block-ref-nested.dj x 15,024 ops/sec ±0.26% (100 runs sampled)
parse inline-autolink.dj x 36,221 ops/sec ±0.30% (98 runs sampled)
parse inline-backticks.dj x 115,746 ops/sec ±0.37% (102 runs sampled)
parse inline-em-flat.dj x 35,967 ops/sec ±0.26% (99 runs sampled)
parse inline-em-nested.dj x 37,750 ops/sec ±0.25% (98 runs sampled)
parse inline-em-worst.dj x 43,255 ops/sec ±0.32% (98 runs sampled)
parse inline-escape.dj x 34,745 ops/sec ±0.50% (99 runs sampled)
parse inline-links-flat.dj x 18,096 ops/sec ±0.35% (100 runs sampled)
parse inline-links-nested.dj x 15,281 ops/sec ±0.31% (100 runs sampled)
parse lorem1.dj x 18,738 ops/sec ±0.48% (99 runs sampled)
parse readme.dj x 2,770 ops/sec ±0.57% (100 runs sampled)
renderHTML readme x 17,337 ops/sec ±0.09% (102 runs sampled)
jgm commented 1 year ago

PS. My memory of optimizing the commonmark.js parser was that it was largely a matter of doing things that made the v8 optimizer happy -- so, it wasn't like optimizing a regular program, where you look for hot spots (the sort of thing we've been doing), but more about doing magic tricks that suddenly made a huge difference.

jgm commented 1 year ago

Tried replacing accumulatedText with a string, but no improvement, slight slowdown in some cases.

jgm commented 1 year ago

I think your first two findings are the most promising. Something is megamorphic, and if we figure out what's triggering that it could have a big effect. Note that in commonmark.js, we just use a Node object that, essentially, has all the fields anything will need (stringContent, destination, title, level, fenceChar, etc.). That ensures that all the objects have the same "shape," which apparently makes the optimizer happy. In djot.js, by contrast, we have AST nodes with different shapes, and maybe this makes any function that takes multiple kinds of AST nodes as parameters megamorphic. I'm not sure how to fix this, though; we want different shapes. One possible option to explore would be keeping the things that vary under a data field, so we have

{ tag: "heading",
  children: ...,
  attributes: ...,
  data: { level: 5 }
}

The second thing, avoiding things ilke "+" + annotation, seems worth exploring for sure. EDIT: The only place I could find where we do "+" + annotation is inbetweenMatches in inlines.ts. I tried removing that by giving this function parameters for both open and close annotation, and just hard-coding "+emph" and "-emph" when calling the function. This did not yield any real difference in benchmarsk.

jgm commented 1 year ago

OK, I managed to remove the need for the .sort() in inlines. This gives us the following benchmarks (around a 20% improvement in inline heavy text):

parse block-bq-flat.dj x 111,995 ops/sec ±0.64% (97 runs sampled)
parse block-bq-nested.dj x 54,569 ops/sec ±1.51% (97 runs sampled)
parse block-code.dj x 224,163 ops/sec ±0.68% (98 runs sampled)
parse block-fences.dj x 210,808 ops/sec ±0.46% (100 runs sampled)
parse block-heading.dj x 77,986 ops/sec ±0.43% (101 runs sampled)
parse block-hr.dj x 83,618 ops/sec ±0.38% (97 runs sampled)
parse block-list-flat.dj x 11,321 ops/sec ±0.54% (101 runs sampled)
parse block-list-nested.dj x 14,747 ops/sec ±0.52% (99 runs sampled)
parse block-ref-flat.dj x 24,116 ops/sec ±0.36% (98 runs sampled)
parse block-ref-nested.dj x 17,338 ops/sec ±0.27% (100 runs sampled)
parse inline-autolink.dj x 44,454 ops/sec ±0.33% (101 runs sampled)
parse inline-backticks.dj x 125,638 ops/sec ±0.37% (101 runs sampled)
parse inline-em-flat.dj x 41,196 ops/sec ±0.28% (96 runs sampled)
parse inline-em-nested.dj x 41,667 ops/sec ±0.24% (101 runs sampled)
parse inline-em-worst.dj x 48,444 ops/sec ±0.32% (100 runs sampled)
parse inline-escape.dj x 39,112 ops/sec ±0.35% (96 runs sampled)
parse inline-links-flat.dj x 22,038 ops/sec ±0.50% (99 runs sampled)
parse inline-links-nested.dj x 17,464 ops/sec ±0.31% (101 runs sampled)
parse lorem1.dj x 22,343 ops/sec ±0.36% (101 runs sampled)
parse readme.dj x 3,100 ops/sec ±0.57% (100 runs sampled)
renderHTML readme x 17,336 ops/sec ±0.08% (98 runs sampled)
jgm commented 1 year ago

In a real-world test with converted versions of the same document, I'm getting:

djot.js: 2874 bytes/ms commonmark.js: 3780 bytes/ms

jgm commented 1 year ago

After a few more optimizations, I've improved it a bit more:

parse block-bq-flat.dj x 116,613 ops/sec ±0.58% (99 runs sampled)
parse block-bq-nested.dj x 57,481 ops/sec ±1.44% (96 runs sampled)
parse block-code.dj x 282,016 ops/sec ±0.48% (98 runs sampled)
parse block-fences.dj x 272,059 ops/sec ±0.49% (95 runs sampled)
parse block-heading.dj x 86,364 ops/sec ±0.65% (97 runs sampled)
parse block-hr.dj x 89,644 ops/sec ±0.38% (99 runs sampled)
parse block-list-flat.dj x 12,203 ops/sec ±0.54% (98 runs sampled)
parse block-list-nested.dj x 15,496 ops/sec ±0.50% (100 runs sampled)
parse block-ref-flat.dj x 24,117 ops/sec ±0.35% (100 runs sampled)
parse block-ref-nested.dj x 17,079 ops/sec ±0.26% (98 runs sampled)
parse inline-autolink.dj x 44,118 ops/sec ±0.32% (100 runs sampled)
parse inline-backticks.dj x 125,903 ops/sec ±0.35% (100 runs sampled)
parse inline-em-flat.dj x 41,459 ops/sec ±0.30% (100 runs sampled)
parse inline-em-nested.dj x 42,237 ops/sec ±0.30% (96 runs sampled)
parse inline-em-worst.dj x 47,392 ops/sec ±0.32% (97 runs sampled)
parse inline-escape.dj x 39,054 ops/sec ±0.32% (99 runs sampled)
parse inline-links-flat.dj x 22,308 ops/sec ±0.36% (101 runs sampled)
parse inline-links-nested.dj x 17,341 ops/sec ±0.30% (99 runs sampled)
parse lorem1.dj x 22,885 ops/sec ±0.36% (98 runs sampled)
parse readme.dj x 3,262 ops/sec ±0.56% (98 runs sampled)
renderHTML readme x 17,430 ops/sec ±0.05% (97 runs sampled)
jgm commented 1 year ago

After 819f8f398a850945c4e17537dbd5a922d74786b7 we're a bit faster:

parse block-bq-flat.dj x 122,598 ops/sec ±0.61% (100 runs sampled)
parse block-bq-nested.dj x 59,940 ops/sec ±1.35% (99 runs sampled)
parse block-code.dj x 285,294 ops/sec ±0.52% (99 runs sampled)
parse block-fences.dj x 272,202 ops/sec ±0.49% (100 runs sampled)
parse block-heading.dj x 91,620 ops/sec ±0.45% (97 runs sampled)
parse block-hr.dj x 93,569 ops/sec ±0.37% (97 runs sampled)
parse block-list-flat.dj x 12,692 ops/sec ±0.51% (100 runs sampled)
parse block-list-nested.dj x 15,946 ops/sec ±0.51% (99 runs sampled)
parse block-ref-flat.dj x 26,055 ops/sec ±0.36% (98 runs sampled)
parse block-ref-nested.dj x 18,252 ops/sec ±0.29% (100 runs sampled)
parse inline-autolink.dj x 47,963 ops/sec ±0.32% (98 runs sampled)
parse inline-backticks.dj x 134,984 ops/sec ±0.40% (94 runs sampled)
parse inline-em-flat.dj x 43,427 ops/sec ±0.27% (96 runs sampled)
parse inline-em-nested.dj x 43,967 ops/sec ±0.38% (100 runs sampled)
parse inline-em-worst.dj x 51,468 ops/sec ±0.51% (96 runs sampled)
parse inline-escape.dj x 42,720 ops/sec ±0.33% (100 runs sampled)
parse inline-links-flat.dj x 24,501 ops/sec ±0.38% (99 runs sampled)
parse inline-links-nested.dj x 18,910 ops/sec ±0.31% (101 runs sampled)
parse lorem1.dj x 24,809 ops/sec ±0.34% (101 runs sampled)
parse readme.dj x 3,486 ops/sec ±0.58% (99 runs sampled)
renderHTML readme x 17,433 ops/sec ±0.04% (100 runs sampled)
jgm commented 1 year ago

After 1492366b9451aa6055ea9dd9c55a0ca18de10632 :

parse block-bq-flat.dj x 132,922 ops/sec ±0.63% (97 runs sampled)
parse block-bq-nested.dj x 64,055 ops/sec ±1.37% (93 runs sampled)
parse block-code.dj x 305,707 ops/sec ±0.67% (98 runs sampled)
parse block-fences.dj x 289,597 ops/sec ±0.52% (97 runs sampled)
parse block-heading.dj x 96,519 ops/sec ±0.59% (93 runs sampled)
parse block-hr.dj x 100,535 ops/sec ±0.73% (99 runs sampled)
parse block-list-flat.dj x 12,279 ops/sec ±0.92% (99 runs sampled)
parse block-list-nested.dj x 15,425 ops/sec ±0.59% (99 runs sampled)
parse block-ref-flat.dj x 27,987 ops/sec ±0.54% (93 runs sampled)
parse block-ref-nested.dj x 19,508 ops/sec ±0.32% (96 runs sampled)
parse inline-autolink.dj x 49,090 ops/sec ±0.32% (100 runs sampled)
parse inline-backticks.dj x 139,996 ops/sec ±0.54% (100 runs sampled)
parse inline-em-flat.dj x 46,086 ops/sec ±0.30% (96 runs sampled)
parse inline-em-nested.dj x 47,134 ops/sec ±0.29% (96 runs sampled)
parse inline-em-worst.dj x 57,647 ops/sec ±0.35% (99 runs sampled)
parse inline-escape.dj x 47,039 ops/sec ±0.34% (100 runs sampled)
parse inline-links-flat.dj x 26,364 ops/sec ±0.39% (101 runs sampled)
parse inline-links-nested.dj x 21,174 ops/sec ±0.33% (98 runs sampled)
parse lorem1.dj x 26,566 ops/sec ±0.38% (100 runs sampled)
parse readme.dj x 3,529 ops/sec ±0.57% (99 runs sampled)
renderHTML readme x 17,457 ops/sec ±0.04% (101 runs sampled)
jgm commented 1 year ago

After ddb155a9849c89f52a958af303e290a88d8447a8

parse block-bq-flat.dj x 131,276 ops/sec ±0.60% (96 runs sampled)
parse block-bq-nested.dj x 63,742 ops/sec ±1.39% (95 runs sampled)
parse block-code.dj x 301,477 ops/sec ±0.64% (100 runs sampled)
parse block-fences.dj x 286,620 ops/sec ±0.50% (97 runs sampled)
parse block-heading.dj x 95,016 ops/sec ±0.50% (99 runs sampled)
parse block-hr.dj x 103,091 ops/sec ±0.37% (102 runs sampled)
parse block-list-flat.dj x 12,714 ops/sec ±0.51% (101 runs sampled)
parse block-list-nested.dj x 15,965 ops/sec ±0.53% (100 runs sampled)
parse block-ref-flat.dj x 28,886 ops/sec ±0.38% (99 runs sampled)
parse block-ref-nested.dj x 19,939 ops/sec ±0.27% (102 runs sampled)
parse inline-autolink.dj x 50,233 ops/sec ±0.35% (100 runs sampled)
parse inline-backticks.dj x 145,375 ops/sec ±0.38% (98 runs sampled)
parse inline-em-flat.dj x 46,144 ops/sec ±0.27% (99 runs sampled)
parse inline-em-nested.dj x 46,735 ops/sec ±0.27% (98 runs sampled)
parse inline-em-worst.dj x 57,474 ops/sec ±0.33% (99 runs sampled)
parse inline-escape.dj x 47,054 ops/sec ±0.35% (99 runs sampled)
parse inline-links-flat.dj x 26,326 ops/sec ±0.37% (101 runs sampled)
parse inline-links-nested.dj x 21,142 ops/sec ±0.32% (101 runs sampled)
parse lorem1.dj x 26,505 ops/sec ±0.37% (98 runs sampled)
parse readme.dj x 3,523 ops/sec ±0.57% (99 runs sampled)
renderHTML readme x 29,028 ops/sec ±0.04% (98 runs sampled)
jgm commented 1 year ago

After 57c81eaaca51e13d0f94b66a4dec63dda95fb6d7

parse block-bq-flat.dj x 132,939 ops/sec ±0.65% (94 runs sampled)
parse block-bq-nested.dj x 64,598 ops/sec ±1.51% (92 runs sampled)
parse block-code.dj x 324,897 ops/sec ±0.68% (100 runs sampled)
parse block-fences.dj x 309,821 ops/sec ±0.53% (97 runs sampled)
parse block-heading.dj x 98,134 ops/sec ±0.49% (95 runs sampled)
parse block-hr.dj x 103,908 ops/sec ±0.39% (100 runs sampled)
parse block-list-flat.dj x 12,846 ops/sec ±0.54% (98 runs sampled)
parse block-list-nested.dj x 16,011 ops/sec ±0.77% (98 runs sampled)
parse block-ref-flat.dj x 29,263 ops/sec ±0.54% (97 runs sampled)
parse block-ref-nested.dj x 19,837 ops/sec ±0.57% (97 runs sampled)
parse inline-autolink.dj x 51,005 ops/sec ±0.33% (99 runs sampled)
parse inline-backticks.dj x 151,295 ops/sec ±0.38% (97 runs sampled)
parse inline-em-flat.dj x 46,415 ops/sec ±0.27% (102 runs sampled)
parse inline-em-nested.dj x 46,887 ops/sec ±0.26% (102 runs sampled)
parse inline-em-worst.dj x 57,678 ops/sec ±0.33% (101 runs sampled)
parse inline-escape.dj x 47,065 ops/sec ±0.35% (102 runs sampled)
parse inline-links-flat.dj x 26,569 ops/sec ±0.41% (101 runs sampled)
parse inline-links-nested.dj x 21,242 ops/sec ±0.33% (100 runs sampled)
parse lorem1.dj x 26,592 ops/sec ±0.36% (101 runs sampled)
parse readme.dj x 3,577 ops/sec ±0.56% (100 runs sampled)
renderHTML readme x 28,820 ops/sec ±0.05% (98 runs sampled)
jgm commented 1 year ago

After upgrading to node v16, times are slower. I note this here so we don't think changes in the code have caused the slower performance:

parse block-bq-flat.dj x 118,939 ops/sec ±0.77% (99 runs sampled)
parse block-bq-nested.dj x 56,804 ops/sec ±0.68% (100 runs sampled)
parse block-code.dj x 287,796 ops/sec ±0.58% (91 runs sampled)
parse block-fences.dj x 279,181 ops/sec ±0.58% (98 runs sampled)
parse block-heading.dj x 91,807 ops/sec ±0.92% (94 runs sampled)
parse block-hr.dj x 90,520 ops/sec ±0.11% (96 runs sampled)
parse block-list-flat.dj x 11,376 ops/sec ±2.63% (94 runs sampled)
parse block-list-nested.dj x 14,992 ops/sec ±0.84% (99 runs sampled)
parse block-ref-flat.dj x 26,437 ops/sec ±0.58% (97 runs sampled)
parse block-ref-nested.dj x 18,655 ops/sec ±0.51% (100 runs sampled)
parse inline-autolink.dj x 48,486 ops/sec ±0.59% (98 runs sampled)
parse inline-backticks.dj x 139,860 ops/sec ±0.54% (99 runs sampled)
parse inline-em-flat.dj x 51,949 ops/sec ±0.45% (97 runs sampled)
parse inline-em-nested.dj x 51,112 ops/sec ±0.51% (98 runs sampled)
parse inline-em-worst.dj x 51,228 ops/sec ±0.45% (96 runs sampled)
parse inline-escape.dj x 42,468 ops/sec ±0.48% (99 runs sampled)
parse inline-links-flat.dj x 24,437 ops/sec ±1.18% (99 runs sampled)
parse inline-links-nested.dj x 18,685 ops/sec ±0.49% (98 runs sampled)
parse lorem1.dj x 23,750 ops/sec ±0.48% (97 runs sampled)
parse readme.dj x 3,275 ops/sec ±0.94% (96 runs sampled)
renderHTML readme x 29,720 ops/sec ±0.07% (98 runs sampled)