Infinite left-recursion when seed parent might be empty

CptWesley commented 1 year ago

Hi. While working on a C# implementation of this algorithm, I ran into an interesting case which caused the parser to run into an infinite loop. To verify that it wasn't just a minor error of mine I also attempted to recreate the same grammar in the reference implementation and ran into the same problem.

The specific case:

List<Rule> rules = Arrays.asList(
  new Rule("A",
    new First(
      new CharSet('a'),
      new Nothing()
    )
  ),
  new Rule("CHAIN",
    new Seq(
      new RuleRef("CHAIN"),
      new RuleRef("A")
    )
  )
);

var grammar = new Grammar(rules);
var mt = grammar.parse("a");

System.out.println("done");

If I'm not mistaken, this runs into an infinite loop because the CHAIN rule may match an empty string, but is also its own (indirect) seed parent, causing it to always get added to the priority queue (which will therefore never be empty). This gives a debug trace of:

...
Matched: CHAIN <- CHAIN A : 0+1
    Following seed parent clause: CHAIN <- CHAIN A
Matched: CHAIN <- CHAIN A : 0+1
    Following seed parent clause: CHAIN <- CHAIN A
Matched: CHAIN <- CHAIN A : 0+1
    Following seed parent clause: CHAIN <- CHAIN A
Matched: CHAIN <- CHAIN A : 0+1
    Following seed parent clause: CHAIN <- CHAIN A
...

Of course, this grammar can be easily rewritten in order to not be left-recursive, but from what I understand the algorithm is supposed to be able to handle these cases. I was wondering if you had any suggestions on how the implementation could be modified in order to be able to handle these cases? My initial thoughts are to simply check if we already attempted to match a clause at a specific position in the input and if so, do not add the clause to the priority queue. However, I'm not sure if this would have other unforeseen side effects on the workings of the algorithm.

CptWesley commented 1 year ago

After some tinkering, using my suggested fix breaks some of the unit tests, so that is not a proper solution for this problem.

lukehutch commented 1 year ago

Hi @CptWesley,

Thanks for finding and reporting this case. It has been a long time since I looked at this algorithm, but off the top of my head, you should be able to detect this case and not add the parent rule if the rule already matched the empty string, and the rule is its own parent.

However, things get trickier when there is an indirect cycle that consumes zero characters for each loop around the cycle (i.e. if a rule is only its own indirect parent).

I suspect the more general fix is to not move upwards to parent phrases if the match at a given start position is exactly the same length as a previous match in that position. Actually I thought that's what the logic already did (since the matching logic requires that matches get at least one character longer for each loop around a grammar cycle), so the implementation may just be buggy.

Let me know if that helps at all...

CptWesley commented 1 year ago

@lukehutch Thank you for your response. When fiddling with things I came to a similar solution as you proposed, but I wasn't sure if that did not create new degenerate cases. In the current reference implementation the requirement of matches growing in size before looping is not enforced. Inside the addMatch the potentially zero-character matching seed parents always get added to the queue, regardless whether or not the clause even matched at all, so any cycle of potentially zero-character matching clauses can loop around the cycle without consuming more characters.

I had a bit of an issue understanding the intuition of simply always adding seed parents that might be empty to the queue. I believe it's to ensure that other matches that are of size 0 also get considered along the line, but I found it hard to think of any scenario where this would be necessary. Removing this second condition also still allows all unit tests to pass.

exaexa commented 1 year ago

Hi @CptWesley ! I think your grammar rule chain should never match anything -- its parsetree is forced to be infinite by the definition, which cannot ever happen. I tried the grammar with my Julia implementation (see https://github.com/LCSB-BioCore/PikaParser.jl) and it doesn't go into infinite loop, so I assume that one of the epsilon-match-avoiding-rules might have come to play. -- Most probably, it might be the one with "match that goes against the topo-order of the rules must be strictly bigger than submatches".

Reference code:

import PikaParser as P

rules = Dict(
    :A => P.first(P.token('a'), P.epsilon),
    :chain => P.seq(:chain, :A),
)

g = P.make_grammar([:chain, :A], P.flatten(rules, Char))

input = "aaaa"

p = P.parse(g, input)

Hope this helps!

CptWesley commented 1 year ago

@exaexa The definition is only infinite for recursive decent parsering due to the left recursion. Which is the problem that this algorithm/paper tries to solve. Perhaps this specific case is not one of the cases that is supposed to be covered (it's been a while since I last looked into this). But this grammar works fine for the family of left-recursion growing parsers (https://dl.acm.org/doi/10.1145/1328408.1328424). Although the definition of "fine" might differ on who you'd ask, since from my understanding these tricks to support left-recursion are usually not fully defined and often not complete (https://www.jstage.jst.go.jp/article/ipsjjip/29/0/29_174/_article).

exaexa commented 1 year ago

@CptWesley

But this grammar works fine for the family of left-recursion growing parser [ref]

My main question was basically about if you sure that you have the correct grammar there-- yours is basically:

A ::= 'a' | ε
chain ::= <chain> <A>

There, the chain recursion does not allow any termination ever -- the rule chain will always spawn one more chain, thus any parse tree that contains chain is forced to be infinite, and I'm not sure if we want to consider an infinite parsetree as a "match" (by all definitions that I ever saw, these are considered unmatchable).

Thus your grammar would be perfectly equivalent to just:

A ::= 'a' | ε

...which corresponds to what I'm getting (in my test I get 4 independent matches of a).

The closest to your grammar that the paper reports is IMO this one:

expr ::= <expr> "-" <num> / <num>

Anyway, your grammar can be converted to a one where chain can be actually matched:

A ::= 'a' | ε
chain ::= <chain> <A> | <A>

...and there I'm actually getting the infinite recursion too. Time to debug! :]

CptWesley commented 1 year ago

@exaexa You're right. I didn't properly inspect my grammar, I assumed it was written as the last snippet you posted. However, I would expect it should be able to detect this and terminate, rather than loop forever, so I think your Julia implementation does better in that regard 😅.

exaexa commented 1 year ago

haha yeah I put some time into making sure the "don't match self again" invariant holds, yet alas in this case it was obviously insufficient. I still think this is fixable though, the loop in the algorithm is detectable and is visibly invalid as it is trying to force new repeated matches of stuff that is already in the match table. The main logic problem is with the epsilon-shortcutting logic there, roughly:

matching the 'a' in A seeds matching of chain
chain fails to find matches for the 1st rule (seq <chain> <A>) normally, but succeeds because <chain> is capable of matching ε by the "can match ε" logic
matching the epsilon seeds chain
the rematch-avoiding logic allows the seeding to continue because it still allows the clauses that can match epsilon, which includes chain (see https://github.com/LCSB-BioCore/PikaParser.jl/blob/e0515317819c6d8ac4b59fb80f268e1142d22152/src/parse.jl#L51 )

I'll be back with updates.

exaexa commented 1 year ago

OK so the actual inconsistency in this grammar

A ::= 'a' | ε
chain ::= <chain> <A> | <A>

seems to be as follows:

chain can match ε
pikaparsers depend on the property that anything in first clauses is NOT able to match ε, except for the last rule. If that is the case, the first clause may never reach the other subclauses after the possible ε-match, because it is greedy and it will always try to succeed first. AFAIK, this is actually a general property of all PEGs.
the first rule in the first-type clause of chain could match ε via rule1, expanding as chain1(chain2(A(ε)),A(ε)), thus the second rule (recursion termination) won't ever be applied.

The few things we thus found so far:

my checking of first-type clauses obv sucks and I need to throw an error reliably if this is detected (@CptWesley so thanks a lot for bringing this up, issues will be fixed!)
if you avoid the possible ε-match in the first chain rule, it works, grammar as below (copying from julia #etoolazy):
```
rules = Dict(
:A => P.token('a'),
:chain => P.first(P.seq(:chain, :A), :A),
)
```

it even works with indirect recursion:

rules = Dict(
:A => P.token('a'),
:X => P.seq(:chain),
:chain => P.first(P.seq(:X, :A), :A),
)

another bug possibly found: I noticed that the thing works (and doesn't crash) if I remove the check for seeding ε-matching clauses at https://github.com/LCSB-BioCore/PikaParser.jl/blob/e0515317819c6d8ac4b59fb80f268e1142d22152/src/parse.jl#L51 . This corresponds to the pikaparser paper of @lukehutch Listing 11, line 28. Surprisingly, everything works for me even if the extra ε-match check is removed -- basically after erasing the whole || ... part of that condition, and all tests stay OK an the example from here doesn't loop infinitely. I think the extra check here shouldn't be needed because ε-matches can moreless "seed themselves" wherever needed, right? @lukehutch any extra info on this would be very welcome. I think this is closely related to the section 6.1 in the paper which even seems to recommend avoiding ε-matches for the seq-type clauses. As the other option, maybe the extra check should be needed when an ε-match could be used to seed other clauses, but at that point it should not check the ε-matching-capability of the "parent" rule but of itself, which is subsumed by the fact that the table would get updated before. (TL;DR version: is that extra rule really needed?)

In perspective, I think this here might actually give a nice defining property of pikaparsers -- problematic left recursion is skipped on the basis of the no-ε-on-the-left rule, reflecting upon the overall nature of PEGs.

lukehutch / pikaparser

Infinite left-recursion when seed parent might be empty #35