kschiess / parslet

A small PEG based parser library. See the Hacking page in the Wiki as well.
kschiess.github.com/parslet
MIT License
809 stars 95 forks source link

Bug with | or incorrect rules? #142

Closed kernow closed 9 years ago

kernow commented 9 years ago

I'm not sure if this is a bug or that the rules are incorrect.

class MyParser < Parslet::Parser
  rule(:block_tag) { str('xx') | str('xxx') | str('aa') | str('aaa') }
  rule(:open) { str('[') >> block_tag >> str(']') }
  root(:open)
end

parser = MyParser.new

parser.block_tag.parse('xx') # works fine
parser.block_tag.parse('xxx') # works fine
parser.block_tag.parse('aa') # works fine
parser.block_tag.parse('aaa') # works fine

parser.open.parse('[xx]') # works fine
parser.open.parse('[xxx]') # fails to match
parser.open.parse('[aa]') # works fine
parser.open.parse('[aaa]') # fails to match

class MyOtherParser < Parslet::Parser
  rule(:block_tag) { str('xxx') | str('xx') | str('aaa') | str('aa') }
  rule(:open) { str('[') >> block_tag >> str(']') }
  root(:open)
end

parser = MyOtherParser.new

parser.block_tag.parse('xx') # works fine
parser.block_tag.parse('xxx') # works fine
parser.block_tag.parse('aa') # works fine
parser.block_tag.parse('aaa') # works fine

parser.open.parse('[xx]') # works fine
parser.open.parse('[xxx]') # works fine
parser.open.parse('[aa]') # works fine
parser.open.parse('[aaa]') # works fine

If xx comes before xxx in the block_tag rule then [xxx] fails to match in the open rules, where as if it comes after then [xxx] is matched.

floere commented 9 years ago

Your example is a bit confusing as it says "Fails to match" for both [xxx] cases. So the second set of cases all match?

kernow commented 9 years ago

Sorry, I cut and paste too much and didn't check it properly. I've edited the original comment now. In the first example xxx and aaa fail to match, in the second example everything matches.

floere commented 9 years ago

What was your expectation?

Oh, note that you can use #cause on the ParseFailed to get more info:

begin
  parser.open.parse('[xxx]') # fails to match
rescue Parslet::ParseFailed => pf
  require 'pp'
  pp pf.cause
end
kernow commented 9 years ago

I would expect all of the examples to match no matter what order the strings are defined within the :block_tag rule. I've used the cause but it's not making any sense to me as it's saying it's found " ," which isn't in any of the strings.

#<Parslet::Cause:0x007fcea3a21970
 @children=
  [#<Parslet::Cause:0x007fcea3a21a88
    @children=nil,
    @message=["Expected \"]\", but got ", "x"@3],
    @pos=#<Parslet::Position:0x007fcea3a21b78 @bytepos=3, @string="[xxx]">,
    @source=
     #<Parslet::Source:0x007fcea3a22d70
      @line_cache=#<Parslet::Source::LineCache:0x007fcea3a22ca8 @line_ends=[]>,
      @re_cache={1=>/(.|$){1}/m, 2=>/(.|$){2}/m},
      @str=#<StringScanner 0/5 @ "[xxx]">>>],
 @message="Failed to match sequence ('[' BLOCK_TAG ']')",
 @pos=#<Parslet::Position:0x007fcea3a21998 @bytepos=3, @string="[xxx]">,
 @source=
  #<Parslet::Source:0x007fcea3a22d70
   @line_cache=#<Parslet::Source::LineCache:0x007fcea3a22ca8 @line_ends=[]>,
   @re_cache={1=>/(.|$){1}/m, 2=>/(.|$){2}/m},
   @str=#<StringScanner 0/5 @ "[xxx]">>>
kernow commented 9 years ago

Note that the issue only occurs when one sting is a substring of the other, e.g. 'xx' is a substring of 'xxx' and when the smaller string is defined in the rule before the larger, e.g. 'xx' before 'xxx'. When distinct strings are used e.g. aa, bb, cc, dd, ee etc. the issue is not present.

floere commented 9 years ago

In a PEG, order matters. The choice operator | is a prioritized choice operator, which means that it will consume the first successful match and move on. That is why there is an x remaining where it expects a ].

This gives you more power over which parse tree is selected, but you need to be more careful about the prioritisation. In your case, the more specific xxx should be evaluated before the xx since the xx is contained in the xxx.

Here (http://www.brynosaurus.com/pub/lang/peg.pdf) is a nice paper on PEGs.

I hope that helps :)

kernow commented 9 years ago

Thanks for the explanations @floere, that makes sense. What would be the best way to support the list of block_tags defined in any order? These rules are defined by users in my use case and so could be listed in any order. I guess ordering the list by length with the longest first before applying them to the rule might work?

floere commented 9 years ago

If those rules are always of the class str('…'), with the users' strings inserted in place of the , then ordering these str rules by decreasing length would work.

floere commented 9 years ago

@kernow My pleasure 😊