houseabsolute / Markdent

An event-based Markdown parser toolkit
http://metacpan.org/release/Markdent/
Other
12 stars 13 forks source link

Room for improvement with respect to speed #9

Closed frioux closed 7 years ago

frioux commented 7 years ago

I built a tool to extract the plaintext from blog posts for the purposes of spell checking, and eventually doing some basic grammar checks too.

Running it against my entire corpus (which I would like to do before pushing each post in the future) takes about 7 seconds right now. My initial find is that ::SpanParser introduces some inefficiencies (not obvious how they can be improved) as to the type constraints. See attached NYTProf report:

zfp-selection-201

I know you have a new type constraint lib; that is likely to shave off about half a second I would guess, but if we could either make SpanParser more efficient or bypass it somehow, I suspect that would be the best option.

What do you think?

frioux commented 7 years ago

Oh yeah, if you wanted to reproduce the above, just:

git clone git://github.com/frioux/blog
cd blog
perl -d:NYTProf bin/plaintext content/posts/*.md content/pages/*.md > /dev/null
autarch commented 7 years ago

The whole design of this module is pretty inefficient compared to many parsers, but I don't see a way to make it both highly efficient and extensible.

The new module you're thinking of is Params::ValidationCompiler. This greatly speeds up per-sub parameter checks compared to the alternatives. Specio is really not any faster than using MX::Types for most cases, but I think Markdent triggers a pathologically slow bit of MX::Types. Specio doesn't have this issue.

I'll switch over to Specio & PVC and we can see if that helps.

autarch commented 7 years ago

Just to clarify. I think this code can definitely be made faster, but it will never be super fast.

frioux commented 7 years ago

Yeah I hear you. My other idea was that the event stream could emit simple hashes, with an optional layer on top that turns those into objects, which I think could make it faster?

autarch commented 7 years ago

I just pushed a branch that switches to Specio & Params::ValidationCompiler. That should help at least a little. No doubt there are other small optimizations to be made too.

As far as switching the event stream to hashes, I would definitely expect that to be faster, but it'd require a huge amount of surgery. If you look at the internals you'll see that the parser itself uses the event objects (look for pending_events, for example).

autarch commented 7 years ago

The profile after the switch eliminates type-related code from the top time consumers, so that's a win. There's definitely more to be done here too. I think one big thing is making the SpanParser->_open_start_event_for_span method faster. There are some obvious wins there like not calling the same method twice. I suspect it may also be possible to simply not call it as much as well.

frioux commented 7 years ago

Yeah, shaves off a little over a second for me, which is nice! FWIW I think you're right re that method.

autarch commented 7 years ago

I pushed another branch called "faster". It makes some small changes to the span parser code that should shave another 10-20% off your parsing time. There are probably further improvements possible with bigger changes to the code, but this one was the simplest big win.

autarch commented 7 years ago

This was all released just now in v0.27

frioux commented 7 years ago

Cool, thanks! I'll close this for now since it doesn't sound like you are thrilled to do any surgery to use hashes (or even arrayrefs) in an underlying event stream ;)

autarch commented 7 years ago

Patches welcome. I suspect there's more smallish improvements to be made without major surgery, but I'm unlikely to rewrite the entire internals any time soon.

frioux commented 7 years ago

Yeah if I get annoyed enough I'll shoot some your way