Comrak is much slower than CMark

DemiMarie commented 7 years ago

Running the spec tests with Comrak takes about 1.9 times as long as it does with CMark.

kivikakk commented 7 years ago

This is correct! It's unclear how to improve it currently; pull requests and analyses welcome.

DemiMarie commented 7 years ago

@kivikakk I created a flamegraph using perf and flamegraph.pl, and concluded that a massive 37.26% of the time is spent in the regex engine. That can probably be sped up substantially (together with a massive decrease in code size) by hand-coding the scanners.

I suspect that eliminating internal copies can more than make up for the rest.

kivikakk commented 7 years ago

@DemiMarie Oh, wonderful! Thank you so much; I haven't had the time recently to do this, so I appreciate your looking into it so much.

I'm away from the computer until next week, but when I get back I'll look into optimising the scanners per #16. After that, I'll see how much copy elimination I can do — I had a hack at this using tendril a while back, so I'll have to revisit that work.

I also wonder how much overhead there is in simple indirection through RefCell for all AST nodes.

DemiMarie commented 7 years ago

@kivikakk It might be possible to reduce even more overhead by not bothering with tendril’s reference-counting, and instead just storing indexes into the original string. Rust’s lifetimes can guarantee that the string outlives the AST.

Another source of overhead is manipulating strings, rather than byte arrays.

SSJohns commented 7 years ago

@DemiMarie Could that really b causing that much overhead? I started on a branch which tries to remove all strings and turn them into &str and use byte searching on them, but I wasn't sure how much it would actually improve performance.

DemiMarie commented 7 years ago

I am not sure (I don’t know how to find it – I think it would show up as uniformly slow code), but I know that pulldown-cmark and MD4C (the fastest CommonMark implementation, by far) use it. MD4C doesn’t even do UTF-8 validation – it promises that the output will be valid UTF-8 if the input is, and that invalid UTF-8 will not cause undefined behavior, but in general invalid UTF-8 input will result in invalid UTF-8 reaching the rendering callbacks.

On Thu, Jul 20, 2017 at 2:18 PM, Shaquille Johnson <notifications@github.com

wrote:

@DemiMarie https://github.com/demimarie Could that really b causing that much overhead? I started on a branch which tries to remove all strings and turn them into &str and use byte searching on them, but I wasn't sure how much it would actually improve performance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kivikakk/comrak/issues/25#issuecomment-316787908, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB6Ob4rTmmdoPODimR_kb5Yr8FHENks5sP5oAgaJpZM4OUb-E .

kivikakk commented 6 years ago

Have pushed a couple of quick wins based on flamegraph analysis (85b35f748302db092872603dd5d3d92da714b15d, 979b0ba55b155d8c3ca94c4a113890e745c6e414). There's heaps more to be done, I'm sure.

(This chopped off ~10–15% on a large document's runtime in debug, though the difference is harder to measure after compiler optimizations.)

kivikakk commented 6 years ago

cmark's make bench now reporting comrak taking ~1.59x as long as cmark (from ~1.77x). I get quite a higher number than you still re: spec tests (2.41s for release mode comrak, 0.85s for cmark).

kivikakk commented 6 years ago

I take that back — I was using a mode to aid in using make bench fairer to comrak.

kivikakk commented 6 years ago

I tried replacing all String/&str with Vec<u8>/&[u8], and was able to get a noticeable increase in speed. The branch is extremely rough so I'll pretty it up soon.

Keats commented 6 years ago

On my machine for the latest versions:

test bench_comrak_article              ... bench:     205,675 ns/iter (+/- 28,488)
test bench_comrak_awesome_rust         ... bench:   2,904,542 ns/iter (+/- 329,912)
test bench_pulldown_cmark_article      ... bench:     148,004 ns/iter (+/- 17,467)
test bench_pulldown_cmark_awesome_rust ... bench:   1,806,455 ns/iter (+/- 209,965)

Bench source: https://github.com/Keats/markdown-benches-rs Comparing with the bench in the README of 8 months ago, it seems it didn't change that much weirdly. I'll do a PR to add the benchmark directly in comrak when I get some time

kivikakk commented 6 years ago

We are definitely always going to be slower than pulldown as we construct an AST, so this is worth keeping in mind.

dragostis commented 6 years ago

I don't know how much of the pie pest eats in the benchmarks (since I'm dealing with a slow connection for a week or so), but I've recently added an extra optimization step for (!("a" | "b" | ... ) ~ any)* expressions which should make for a non-trivial improvement in performance. It's lurking on the v1.1 branch.

kivikakk commented 6 years ago

I gave it a try on a progit benchmark, but there wasn't a measurable improvement; I don't think we rely on the lexer enough to make a significant difference, alas. Thank you though!

brson commented 6 years ago

FWIW it's worth, one of my mandates is to focus on comrk perormance. I have no idea what what will come of it - a lot of the lowest-hanging fruit is in Reddit's downstream changes, but I would love to make comrak competitive to cmark. I'm hopeful we can come up with something clever to close the gap.

But yeah, benhchmarks against cmark would be more more comparable to benchmarks against pulldown.

ldm0 commented 4 years ago

I forked @Keats 's benching repo, tested with latest version of two crates and got astonishing results(i7-7700HQ):

test bench_comrak_article              ... bench:     415,293 ns/iter (+/- 93,497)
test bench_comrak_awesome_rust         ... bench:   9,720,510 ns/iter (+/- 2,561,997)
test bench_pulldown_cmark_article      ... bench:      60,575 ns/iter (+/- 11,240)
test bench_pulldown_cmark_awesome_rust ... bench:     868,952 ns/iter (+/- 293,981)

It seems that the performance even regresses a lot after two years. I think there are definitely many things we can do to improve the performance. LIke many users of comrak I'd like to help but got no direction.

YJDoc2 commented 2 years ago

Hey, I'm not sure if the issue with the speed is related to RefCells , but if it is, would it help removing the ref cells? I was looking around the source and as far as I understand, the main reason of using them, is to store the references of parent/child in the nodes of the tree. If that is the only reason to use them, then it would mean we still uphold the condition of rust asking for multiple immutable refs or single mutable ref at any given time, but manually, rather than a compile-time way.

If that is the case, would it be possible to eliminate RefCells altogether using generational-arena ? This has insert method, which returns an index, which can be thought as a uuid for the inserted element. The index itself is made up of Copy types such as usize and u64. To get an inserted item, we need to use this index.

Using this, we can store the index of nodes in the tree rather than references, and then get the node contents from the Arean using the index when needed. I tried a naive/brute-force implementation using this, which does not compile, as with current structure the lifetime bounds are not satisfied ; which tells me that if we use this, it will need some heavy structure change of the tree. Another thing to note is that gen-arena needs &mut self to it in order to insert/get mutable ref of the inserted elements, so that will need to change, as typed-arena currently used can alloc with &self.

The gen-arena is backed up by a vec so I think that the access time should be O(1) and insert time should be amortized O(1), but if they use some other layer over vec internally then this might be different.

If the RefCell access takes a large share of time at runtime , then maybe this can help, but if that is not big issue, gen-arena might not do much, or might make things worse.

What do you think?

YJDoc2 commented 2 years ago

Hey, So after spending some time in implementing the gen-arena tree the wrong way, and then spending some time in half-implementing it the right way, I'm not sure if this will improve the things much. The issue I feel is we need to get reference to parent/sibling , which we get through RefCell, which takes up some time. But I think that also is still O(1) ? And using gen-arena, and using its Index instead of RefCell, we still get the same effect, except instead of calling get* of refcell, we need to call get* methods of gen-arena. Conceptually, we can remove it altogether, store all of our nodes into a vec, never delete node actually, just remove it from the tree, and instead of refcells, use the position in vec to get the exact same effect. But still this will involve some overhead every time we access the vec for bound check, which should be equivalent to refcell's internal checks (?). In C I think this can be avoided, as we can directly sore pointers, and move on the chain of pointers directly (a->b->c.data), which makes it fast, but in Rust if we need refrences like this, it will always involve certain overhead (?) and thus will always be slower in that aspect. Personally I have seen AST construction and traversing in root->leaf direction only, which works as each parent owns its nodes, but in code here I saw quite a few uses of accessing parent values, thus we cannot do it that way. I am not sure if this analysis is correct, so it'd be great to hear some opinions on it if someone has more idea , Thanks :)

kivikakk commented 1 week ago

Closing this as I think the basis of it is unlikely to change — we construct an AST in the same way as cmark, but we have the added weight of dynamic borrow checks which we cannot avoid.

Similarly, direct comparisons of Markdown-to-HTML conversion between Comrak and pulldown-cmark are not quite apples-to-apples, since pulldown-cmark's very design means it doesn't need to construct an AST, which is exactly where the "unavoidable" part of Comrak's overhead lies.

Performance-wise we should have experienced improvements since we were last measuring, particularly due to the adoption of re2c (context). Here's the same benchmark repository, but with current versions used (Comrak 0.24.1 and pulldown-cmark 0.11.0):

test bench_comrak_article              ... bench:      80,254 ns/iter (+/- 702)
test bench_comrak_awesome_rust         ... bench:     857,103 ns/iter (+/- 9,545)
test bench_pulldown_cmark_article      ... bench:      27,397 ns/iter (+/- 256)
test bench_pulldown_cmark_awesome_rust ... bench:     397,781 ns/iter (+/- 4,406)

Here's the bench run on the same machine with crate versions as of May 3, 2020 (Comrak 0.7.0 and pulldown-cmark 0.7.1)

test bench_comrak_article              ... bench:     148,573 ns/iter (+/- 1,156)
test bench_comrak_awesome_rust         ... bench:   4,101,839 ns/iter (+/- 41,441)
test bench_pulldown_cmark_article      ... bench:      24,294 ns/iter (+/- 230)
test bench_pulldown_cmark_awesome_rust ... bench:     329,409 ns/iter (+/- 1,885)

Removing parent pointers is an interesting thought, @YJDoc2, and might simplify some aspects of the tree maintenance and node size, but it would push work to the main code to maintain and juggle lists of ancestors instead.

It's not impossible, but I imagine it would be quite complicating, and I think it's unlikely to make the difference we hope for, as it won't obviate the need for dynamic borrow checks and thus RefCell.

kivikakk / comrak

Comrak is much slower than CMark #25