Closed frioux closed 7 years ago
Oh yeah, if you wanted to reproduce the above, just:
git clone git://github.com/frioux/blog
cd blog
perl -d:NYTProf bin/plaintext content/posts/*.md content/pages/*.md > /dev/null
The whole design of this module is pretty inefficient compared to many parsers, but I don't see a way to make it both highly efficient and extensible.
The new module you're thinking of is Params::ValidationCompiler
. This greatly speeds up per-sub parameter checks compared to the alternatives. Specio is really not any faster than using MX::Types for most cases, but I think Markdent triggers a pathologically slow bit of MX::Types. Specio doesn't have this issue.
I'll switch over to Specio & PVC and we can see if that helps.
Just to clarify. I think this code can definitely be made faster, but it will never be super fast.
Yeah I hear you. My other idea was that the event stream could emit simple hashes, with an optional layer on top that turns those into objects, which I think could make it faster?
I just pushed a branch that switches to Specio & Params::ValidationCompiler. That should help at least a little. No doubt there are other small optimizations to be made too.
As far as switching the event stream to hashes, I would definitely expect that to be faster, but it'd require a huge amount of surgery. If you look at the internals you'll see that the parser itself uses the event objects (look for pending_events
, for example).
The profile after the switch eliminates type-related code from the top time consumers, so that's a win. There's definitely more to be done here too. I think one big thing is making the SpanParser->_open_start_event_for_span
method faster. There are some obvious wins there like not calling the same method twice. I suspect it may also be possible to simply not call it as much as well.
Yeah, shaves off a little over a second for me, which is nice! FWIW I think you're right re that method.
I pushed another branch called "faster". It makes some small changes to the span parser code that should shave another 10-20% off your parsing time. There are probably further improvements possible with bigger changes to the code, but this one was the simplest big win.
This was all released just now in v0.27
Cool, thanks! I'll close this for now since it doesn't sound like you are thrilled to do any surgery to use hashes (or even arrayrefs) in an underlying event stream ;)
Patches welcome. I suspect there's more smallish improvements to be made without major surgery, but I'm unlikely to rewrite the entire internals any time soon.
Yeah if I get annoyed enough I'll shoot some your way
I built a tool to extract the plaintext from blog posts for the purposes of spell checking, and eventually doing some basic grammar checks too.
Running it against my entire corpus (which I would like to do before pushing each post in the future) takes about 7 seconds right now. My initial find is that ::SpanParser introduces some inefficiencies (not obvious how they can be improved) as to the type constraints. See attached NYTProf report:
I know you have a new type constraint lib; that is likely to shave off about half a second I would guess, but if we could either make SpanParser more efficient or bypass it somehow, I suspect that would be the best option.
What do you think?