peggyjs / peggy

Peggy: Parser generator for JavaScript
https://peggyjs.org/
MIT License
914 stars 65 forks source link

Possibly broken branching when expressions are repetitions #545

Closed devmattrick closed 2 weeks ago

devmattrick commented 2 weeks ago

I definitely could be missing something here, but I'm having a bit of trouble creating a "fallback" rule for when parsing of one rule fails. For example:

Expr = A / Anything
A = "a"+
Anything = .*

For an input like aaaab, I'd expect this grammar to first test the A rule, which would fail due to the b being present in the input. It should then try the Anything rule which should be able to parse this input successfully (this can be confirmed by removing the A branch).

However, it seems that it fails at the first sign of trouble when parsing the A branch:

Error: Expected "a" or end of input but "b" found.
 --> input:1:5
  |
1 | aaaab
  |     ^

This behavior seems odd to me since the documentation for branches states:

Try to match the first expression, if it does not succeed, try the second one, etc. Return the match result of the first successfully matched expression. If no expression matches, consider the match failed.

It does seem to work with a simpler grammar like:

Expr = A / Anything
A = "a"
Anything = .

I tried to work around this issue by doing the following:

Expr = (!A Anything) / A
A = "a"+
Anything = .*

But I still get the same error, I think possibly because the A rule is being processed for the negation and causes the parsing to fail early still?

devmattrick commented 2 weeks ago

I did some digging into the Peggy source code and I believe I may have found the root cause for this. For rules that are "optional" in some way (i.e. A? or A*), they are treated as "always matches" which causes the parser generator to "ignore" any following branches since they are always expected to match.

I'm not sure if this is intended behavior -- it seems a bit counterintuitive to me. I believe changing these rules to be "sometimes match" rules fixes the issue I was experiencing. Is this an acceptable solution? If so, I'm happy to make a PR for this!

Mingun commented 2 weeks ago

This is intended. Your A rule is matched, but implicit EOF rule after it is not, that is why you get error. In other words, you should read your grammar as:

Expr = (A / Anything) EOF
A = "a"+
Anything = .*
EOF = !.
devmattrick commented 2 weeks ago

Ah, that makes sense. The A rule is matched successfully which makes it not follow the Anything branch so the only valid characters here are a or an EOF.

I guess in this case the fix that would give me my intended behavior is asserting in the A rule that the as reach to the EOF so it will not match for as mixed with other characters:

Expr = A / Anything
A = "a"+ EOF
Anything = .+ EOF
EOF = !.

Thanks for the help -- sorry for the noise!