kschiess / parslet

A small PEG based parser library. See the Hacking page in the Wiki as well.
kschiess.github.com/parslet
MIT License
809 stars 95 forks source link

any.repeat is slow -- what about a `finished` atom? #146

Closed sheldon-b closed 8 years ago

sheldon-b commented 8 years ago

I'm using parslet to detect snippets of a scripting language embedded in a larger Ruby DSL source file. These DSL source files are often large, consisting of thousands of lines. A typical structure might be something like this:

<Ruby DSL code>
<scripting language block>
<scripting language block>
<scripting language block>
<More Ruby DSL code>

Parslet is used to detect and extract the snippets of scripting language from the larger DSL source file. The DSL is then parsed in Ruby, and the extracted scripting snippets are parsed using a parslet parser.

I detect the beginning of each scripting block by scanning for a particular pattern, and then use a parslet parser to detect the end of the scripting block. The parser used for this is essentially:

class SingleBlockParser < Parslet::Parser
  root :embedded_block
  rule(:embedded_block) {
    script_block.as(:BLOCK) >> # Definition left out
    str("\n").as(:END_OF_BLOCK) >>
    any.repeat # This is really slow!
  }

This works well but the any.repeat pattern is very slow to parse. As you can see, any.repeat is simply used to ignore the remainder of the file. This became an issue when some files would take up to 90s to parse. I monkey-patched a Finished atom which speeds up the process significantly -- down to 13s for the same file. The Finished atom is used to consume the remainder of the input and always succeeds.

class SingleBlockParser < Parslet::Parser
  root :embedded_block
  rule(:embedded_block) {
    script_block.as(:BLOCK) >> # Definition left out
    str("\n").as(:END_OF_BLOCK) >>
    finished # Much faster
  }

It seems like this could be a useful case for other people as well. What do you think?

kschiess commented 8 years ago

What is your proposition?

sheldon-b commented 8 years ago

My proposal is to add a Finished atom which consumes the remaining input and always succeeds. See PR #148.

sheldon-b commented 8 years ago

Specifying prefix: true in the call to Parser#parse achieves the same effect