Difference in matching of recursive "First"/"Ordered Choice" compared to top-down parsers

lukehutch / pikaparser

The Pika Parser reference implementation

MIT License

141 stars 12 forks source link

Difference in matching of recursive "First"/"Ordered Choice" compared to top-down parsers #32

Closed maxfischer2781 closed 2 years ago

maxfischer2781 commented 3 years ago

Hello, I have re-implemented the Pika algorithm in Python (https://github.com/maxfischer2781/bootpeg) but stumbled over a case where the Pika algorithm seems to behave notably different than a left-recursive Packrat. Not sure whether that's more appropriate for the paper or reference implementation, but it seems relevant for your work.

The context are left-recursive rules for left-associative binary operations. Consider for example Python's +/- grammar (translated to PEG from source grammar):

sum <- sum '+' term / sum '-' term / term

Notably, since this grammar is used for applying separate actions for + and - there are two almost identical choices of entire binary operation rules. For reference, the rule used in the paper/reference implementation internalises the choice to a single rule, avoiding the situation:

E0 <- (E0 ('+'/'-') E1) / E1;

The issue can be traced well with the above sum rule and an input such as this:

a + b - c

The important part is that the left-most operator term corresponds to the left-most ("lowest ClauseIdx") grammar term.

The desired parse tree, as matched by a left-recursive Packrat parser, looks like this:

          (sum) '-' (term)
            /           \
    (sum) '+' (term)    'c'
      /         \
  (term)        'b'
    /
  'a'

Basically, the parser matches term='a' allowing it to match sum=(term='a') '+' term='b' and finally sum=(sum='a + b') '-' term='c'. While the "- sum" rule has higher ClauseIdx than the "+ sum" rule, it is a valid match because it contains the lower ClauseIdx rule.

The Pika algorithm seems incapable of matching like this: Since a "- sum" match has higher ClauseIdx, it does not replace the previous "+ sum" match. Match:isBetterThan only allows to select by length or ClauseIdx, but not by nesting.

Without having completely worked out the logic or tested it yet, it seems to me there are two potential approaches:

Introduce a Generation in addition to the ClauseIdx. A match replaces another iff it has better Generation, same Generation and better ClauseIdx, or same Generation and same ClauseIdx and better length.
Introdcue a sub-clause check. A match replaces another iff it is a parent match of the other, or has better ClauseIdx, or same ClauseIdx and better length.

Since this only applies to left-recursive choices, it can probably be optimised when preparing a parser: There are "recursive choice", "choice" and "plain" clauses; only recursive choices need to be tracked for nesting.

lukehutch commented 3 years ago

Hi, Thanks for the feedback. I definitely ran into this issue with earlier versions of the algorithm, when I didn't have all the logic figured out, but with the algorithm described in the paper, and the reference implementation in git, your example is correctly parsed. (I extended the grammar slightly to support a series of statements, since this is the form used in one of the unit tests.)

It's possible that the pika parser algorithm does break in some cases, but it seems to handle this case just fine. Please let me know if you come up with another example that breaks the reference parser in addition to your reimplementation.

Source code:

    public static void main(String[] args) {
        final var grammar = MetaGrammar.parse("Program <- Statement+;\n" //
                + "Statement <- var:[a-z]+ '=' E ';';\n" //
                + "E[4] <- '(' E ')';\n" //
                + "E[3] <- num:[0-9]+ / sym:[a-z]+;\n" //
                + "E[2] <- arith:(op:'-' E);\n" //
                + "E[1,L] <- arith:(E op:('*' / '/') E);\n" //
                + "E[0,L] <- arith:(E op:('+' / '-') E);");
        final var memoTable = grammar.parse("a+b-c");
        ParserInfo.printParseResult("Program", memoTable, new String[] {"Statement"}, false);
    }

Result:

Clauses:
 25 : Program <- Statement+
 24 : Statement <- var:[a-z]+ '=' E ';'
 23 : E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1]
 22 : E[0] op:('+' / '-') E[1]
 21 : E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2]
 20 : E[1] op:('*' / '/') E[2]
 19 : E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3]
 18 : op:'-' (E[2] / E[3])
 17 : E[2] / E[3]
 16 : E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4]
 15 : E[4] <- '(' E[0] ')'
 14 : num:[0-9]+ / sym:[a-z]+
 13 : [0-9]+
 12 : '*' / '/'
 11 : '+' / '-'
 10 : [a-z]+
  9 : ';'
  8 : ')'
  7 : '('
  6 : [0-9]
  5 : '/'
  4 : '*'
  3 : '-'
  2 : '+'
  1 : '='
  0 : [a-z]

Memo Table:
 25 : Program <- Statement+                            #=======-
 24 : Statement <- var:[a-z]+ '=' E ';'                #=======-
 23 : E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1]  #-#====--
 22 : E[0] op:('+' / '-') E[1]                         --#====--
 21 : E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2]  #-#-##=--
 20 : E[1] op:('*' / '/') E[2]                         ---------
 19 : E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3]      #-#-##=--
 18 : op:'-' (E[2] / E[3])                             -----#=--
 17 : E[2] / E[3]                                      #-#-##=--
 16 : E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4]         #-#-#-#--
 15 : E[4] <- '(' E[0] ')'                             ---------
 14 : num:[0-9]+ / sym:[a-z]+                          #-#-#-#--
 13 : [0-9]+                                           ---------
 12 : '*' / '/'                                        ---------
 11 : '+' / '-'                                        ---#-#---
 10 : [a-z]+                                           #-#-#-#--
  9 : [terminal] ';'                                   -------#-
  8 : [terminal] ')'                                   ---------
  7 : [terminal] '('                                   ---------
  6 : [terminal] [0-9]                                 ---------
  5 : [terminal] '/'                                   ---------
  4 : [terminal] '*'                                   ---------
  3 : [terminal] '-'                                   -----#---
  2 : [terminal] '+'                                   ---#-----
  1 : [terminal] '='                                   -#-------
  0 : [terminal] [a-z]                                 #-#-#-#--
                                                       01234567
                                                       x=a+b-c;

Match tree for rule Program:
                                                       ┌───────────────┐  
                           25 : Program <- Statement+  │x = a + b - c ;│  
                                                       ├───────────────┤  
               24 : Statement <- var:[a-z]+ '=' E ';'  │x = a + b - c ;│  
                                                       │   ┌─────────┐ │  
 23 : E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1]  │   │a + b - c│ │  
                                                       │   ├─────────┤ │  
                        22 : E[0] op:('+' / '-') E[1]  │   │a + b - c│ │  
                                                       │   ├─────┐   │ │  
 23 : E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1]  │   │a + b│   │ │  
                                                       │   ├─────┤   │ │  
                        22 : E[0] op:('+' / '-') E[1]  │   │a + b│   │ │  
                                                       │   ├─┐   │   │ │  
 23 : E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1]  │   │a│   │   │ │  
                                                       │   ├─┤ ┌─┤ ┌─┤ │  
 21 : E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
     19 : E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
        16 : E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
                         14 : num:[0-9]+ / sym:[a-z]+  │   │a│ │b│ │c│ │  
                                                       │   │ ├─┤ ├─┤ │ │  
                                       11 : '+' / '-'  │   │ │+│ │-│ │ │  
                                                       ├─┐ ├─┤ ├─┤ ├─┤ │  
                                          10 : [a-z]+  │x│ │a│ │b│ │c│ │  
                                                       │ │ │ │ │ │ │ ├─┤  
                                   9 : [terminal] ';'  │ │ │ │ │ │ │ │;│  
                                                       │ │ │ │ │ ├─┤ │ │  
                                   3 : [terminal] '-'  │ │ │ │ │ │-│ │ │  
                                                       │ │ │ ├─┤ │ │ │ │  
                                   2 : [terminal] '+'  │ │ │ │+│ │ │ │ │  
                                                       │ ├─┤ │ │ │ │ │ │  
                                   1 : [terminal] '='  │ │=│ │ │ │ │ │ │  
                                                       ├─┤ ├─┤ ├─┤ ├─┤ │  
                                 0 : [terminal] [a-z]  │x│ │a│ │b│ │c│ │  
                                                        0 1 2 3 4 5 6 7 
                                                        x = a + b - c ; 

====================================

Matches for Program <- Statement+ :
└─Program <- Statement+ : 0+8 : "x=a+b-c;"
  └─Statement <- var:[a-z]+ '=' E ';' : 0+8 : "x=a+b-c;"
    ├─var:[a-z]+ : 0+1 : "x"
    │ └─[a-z] : 0+1 : "x"
    ├─'=' : 1+1 : "="
    ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+5 : "a+b-c"
    │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+5 : "a+b-c"
    │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+3 : "a+b"
    │   │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+3 : "a+b"
    │   │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+1 : "a"
    │   │   │ └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 2+1 : "a"
    │   │   │   └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
    │   │   │     └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
    │   │   │       └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
    │   │   │         └─sym:[a-z]+ : 2+1 : "a"
    │   │   │           └─[a-z] : 2+1 : "a"
    │   │   ├─op:('+' / '-') : 3+1 : "+"
    │   │   │ └─'+' : 3+1 : "+"
    │   │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 4+1 : "b"
    │   │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
    │   │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
    │   │         └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    │   │           └─sym:[a-z]+ : 4+1 : "b"
    │   │             └─[a-z] : 4+1 : "b"
    │   ├─op:('+' / '-') : 5+1 : "-"
    │   │ └─'-' : 5+1 : "-"
    │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 6+1 : "c"
    │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
    │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
    │         └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    │           └─sym:[a-z]+ : 6+1 : "c"
    │             └─[a-z] : 6+1 : "c"
    └─';' : 7+1 : ";"

====================================

Matches for Statement <- var:[a-z]+ '=' E ';' :
└─Statement <- var:[a-z]+ '=' E ';' : 0+8 : "x=a+b-c;"
  ├─var:[a-z]+ : 0+1 : "x"
  │ └─[a-z] : 0+1 : "x"
  ├─'=' : 1+1 : "="
  ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+5 : "a+b-c"
  │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+5 : "a+b-c"
  │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+3 : "a+b"
  │   │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+3 : "a+b"
  │   │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+1 : "a"
  │   │   │ └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 2+1 : "a"
  │   │   │   └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
  │   │   │     └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
  │   │   │       └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
  │   │   │         └─sym:[a-z]+ : 2+1 : "a"
  │   │   │           └─[a-z] : 2+1 : "a"
  │   │   ├─op:('+' / '-') : 3+1 : "+"
  │   │   │ └─'+' : 3+1 : "+"
  │   │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 4+1 : "b"
  │   │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
  │   │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
  │   │         └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │   │           └─sym:[a-z]+ : 4+1 : "b"
  │   │             └─[a-z] : 4+1 : "b"
  │   ├─op:('+' / '-') : 5+1 : "-"
  │   │ └─'-' : 5+1 : "-"
  │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 6+1 : "c"
  │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
  │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
  │         └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
  │           └─sym:[a-z]+ : 6+1 : "c"
  │             └─[a-z] : 6+1 : "c"
  └─';' : 7+1 : ";"

====================================

Matches for E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] :
└─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 0+1 : "x"
  └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 0+1 : "x"
    └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 0+1 : "x"
      └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 0+1 : "x"
        └─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
          └─sym:[a-z]+ : 0+1 : "x"
            └─[a-z] : 0+1 : "x"

└─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+5 : "a+b-c"
  └─arith:(E[0] op:('+' / '-') E[1]) : 2+5 : "a+b-c"
    ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+3 : "a+b"
    │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+3 : "a+b"
    │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+1 : "a"
    │   │ └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 2+1 : "a"
    │   │   └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
    │   │     └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
    │   │       └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
    │   │         └─sym:[a-z]+ : 2+1 : "a"
    │   │           └─[a-z] : 2+1 : "a"
    │   ├─op:('+' / '-') : 3+1 : "+"
    │   │ └─'+' : 3+1 : "+"
    │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 4+1 : "b"
    │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
    │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
    │         └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    │           └─sym:[a-z]+ : 4+1 : "b"
    │             └─[a-z] : 4+1 : "b"
    ├─op:('+' / '-') : 5+1 : "-"
    │ └─'-' : 5+1 : "-"
    └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 6+1 : "c"
      └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
        └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
          └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
            └─sym:[a-z]+ : 6+1 : "c"
              └─[a-z] : 6+1 : "c"

====================================

Matches for E[0] op:('+' / '-') E[1] :
└─E[0] op:('+' / '-') E[1] : 2+5 : "a+b-c"
  ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+3 : "a+b"
  │ └─arith:(E[0] op:('+' / '-') E[1]) : 2+3 : "a+b"
  │   ├─E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1] : 2+1 : "a"
  │   │ └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 2+1 : "a"
  │   │   └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
  │   │     └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
  │   │       └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
  │   │         └─sym:[a-z]+ : 2+1 : "a"
  │   │           └─[a-z] : 2+1 : "a"
  │   ├─op:('+' / '-') : 3+1 : "+"
  │   │ └─'+' : 3+1 : "+"
  │   └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 4+1 : "b"
  │     └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
  │       └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
  │         └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │           └─sym:[a-z]+ : 4+1 : "b"
  │             └─[a-z] : 4+1 : "b"
  ├─op:('+' / '-') : 5+1 : "-"
  │ └─'-' : 5+1 : "-"
  └─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 6+1 : "c"
    └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
      └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
        └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
          └─sym:[a-z]+ : 6+1 : "c"
            └─[a-z] : 6+1 : "c"

====================================

Matches for E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] :
└─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 0+1 : "x"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 0+1 : "x"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 0+1 : "x"
      └─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
        └─sym:[a-z]+ : 0+1 : "x"
          └─[a-z] : 0+1 : "x"

└─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 2+1 : "a"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
      └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
        └─sym:[a-z]+ : 2+1 : "a"
          └─[a-z] : 2+1 : "a"

└─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 4+1 : "b"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
      └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
        └─sym:[a-z]+ : 4+1 : "b"
          └─[a-z] : 4+1 : "b"

└─E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2] : 5+2 : "-c"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 5+2 : "-c"
    └─arith:(op:'-' (E[2] / E[3])) : 5+2 : "-c"
      ├─op:'-' : 5+1 : "-"
      └─E[2] / E[3] : 6+1 : "c"
        └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
          └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
            └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
              └─sym:[a-z]+ : 6+1 : "c"
                └─[a-z] : 6+1 : "c"

====================================

No matches for E[1] op:('*' / '/') E[2]

====================================

Matches for E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] :
└─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 0+1 : "x"
  └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 0+1 : "x"
    └─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
      └─sym:[a-z]+ : 0+1 : "x"
        └─[a-z] : 0+1 : "x"

└─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
  └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
    └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
      └─sym:[a-z]+ : 2+1 : "a"
        └─[a-z] : 2+1 : "a"

└─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
  └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
    └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
      └─sym:[a-z]+ : 4+1 : "b"
        └─[a-z] : 4+1 : "b"

└─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 5+2 : "-c"
  └─arith:(op:'-' (E[2] / E[3])) : 5+2 : "-c"
    ├─op:'-' : 5+1 : "-"
    └─E[2] / E[3] : 6+1 : "c"
      └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
        └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
          └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
            └─sym:[a-z]+ : 6+1 : "c"
              └─[a-z] : 6+1 : "c"

====================================

Matches for op:'-' (E[2] / E[3]) :
└─op:'-' (E[2] / E[3]) : 5+2 : "-c"
  ├─op:'-' : 5+1 : "-"
  └─E[2] / E[3] : 6+1 : "c"
    └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
      └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
        └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
          └─sym:[a-z]+ : 6+1 : "c"
            └─[a-z] : 6+1 : "c"

====================================

Matches for E[2] / E[3] :
└─E[2] / E[3] : 0+1 : "x"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 0+1 : "x"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 0+1 : "x"
      └─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
        └─sym:[a-z]+ : 0+1 : "x"
          └─[a-z] : 0+1 : "x"

└─E[2] / E[3] : 2+1 : "a"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 2+1 : "a"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
      └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
        └─sym:[a-z]+ : 2+1 : "a"
          └─[a-z] : 2+1 : "a"

└─E[2] / E[3] : 4+1 : "b"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 4+1 : "b"
    └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
      └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
        └─sym:[a-z]+ : 4+1 : "b"
          └─[a-z] : 4+1 : "b"

└─E[2] / E[3] : 5+2 : "-c"
  └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 5+2 : "-c"
    └─arith:(op:'-' (E[2] / E[3])) : 5+2 : "-c"
      ├─op:'-' : 5+1 : "-"
      └─E[2] / E[3] : 6+1 : "c"
        └─E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3] : 6+1 : "c"
          └─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
            └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
              └─sym:[a-z]+ : 6+1 : "c"
                └─[a-z] : 6+1 : "c"

====================================

Matches for E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] :
└─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 0+1 : "x"
  └─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
    └─sym:[a-z]+ : 0+1 : "x"
      └─[a-z] : 0+1 : "x"

└─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 2+1 : "a"
  └─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
    └─sym:[a-z]+ : 2+1 : "a"
      └─[a-z] : 2+1 : "a"

└─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 4+1 : "b"
  └─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    └─sym:[a-z]+ : 4+1 : "b"
      └─[a-z] : 4+1 : "b"

└─E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4] : 6+1 : "c"
  └─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    └─sym:[a-z]+ : 6+1 : "c"
      └─[a-z] : 6+1 : "c"

====================================

No matches for E[4] <- '(' E[0] ')'

====================================

Matches for num:[0-9]+ / sym:[a-z]+ :
└─num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
  └─sym:[a-z]+ : 0+1 : "x"
    └─[a-z] : 0+1 : "x"

└─num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
  └─sym:[a-z]+ : 2+1 : "a"
    └─[a-z] : 2+1 : "a"

└─num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  └─sym:[a-z]+ : 4+1 : "b"
    └─[a-z] : 4+1 : "b"

└─num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
  └─sym:[a-z]+ : 6+1 : "c"
    └─[a-z] : 6+1 : "c"

====================================

No matches for [0-9]+

====================================

No matches for '*' / '/'

====================================

Matches for '+' / '-' :
└─'+' / '-' : 3+1 : "+"
  └─'+' : 3+1 : "+"

└─'+' / '-' : 5+1 : "-"
  └─'-' : 5+1 : "-"

====================================

Matches for [a-z]+ :
└─[a-z]+ : 0+1 : "x"
  └─[a-z] : 0+1 : "x"

└─[a-z]+ : 2+1 : "a"
  └─[a-z] : 2+1 : "a"

└─[a-z]+ : 4+1 : "b"
  └─[a-z] : 4+1 : "b"

└─[a-z]+ : 6+1 : "c"
  └─[a-z] : 6+1 : "c"

====================================

Matches for ';' :
└─';' : 7+1 : ";"

====================================

No matches for ')'

====================================

No matches for '('

====================================

No matches for [0-9]

====================================

No matches for '/'

====================================

No matches for '*'

====================================

Matches for '-' :
└─'-' : 5+1 : "-"

====================================

Matches for '+' :
└─'+' : 3+1 : "+"

====================================

Matches for '=' :
└─'=' : 1+1 : "="

====================================

Matches for [a-z] :
└─[a-z] : 0+1 : "x"

└─[a-z] : 2+1 : "a"

└─[a-z] : 4+1 : "b"

└─[a-z] : 6+1 : "c"

====================================

AST for rule "Program":

└─Program : 0+8 : "x=a+b-c;"
  ├─var : 0+1 : "x"
  └─arith : 2+5 : "a+b-c"
    ├─arith : 2+3 : "a+b"
    │ ├─sym : 2+1 : "a"
    │ ├─op : 3+1 : "+"
    │ └─sym : 4+1 : "b"
    ├─op : 5+1 : "-"
    └─sym : 6+1 : "c"

Num match objects created: 55
Num match objects memoized:  55

lukehutch commented 3 years ago

And just for fun, let's reverse the associativity from left-associative to right-associative:

                + "E[1,R] <- arith:(E op:('*' / '/') E);\n" //
                + "E[0,R] <- arith:(E op:('+' / '-') E);");

Result (trimmed):

Match tree for rule Program:
                                                       ┌───────────────┐  
                           25 : Program <- Statement+  │x = a + b - c ;│  
                                                       ├───────────────┤  
               24 : Statement <- var:[a-z]+ '=' E ';'  │x = a + b - c ;│  
                                                       │   ┌─────────┐ │  
 23 : E[0] <- arith:(E[1] op:('+' / '-') E[0]) / E[1]  │   │a + b - c│ │  
                                                       │   ├─────────┤ │  
                        22 : E[1] op:('+' / '-') E[0]  │   │a + b - c│ │  
                                                       │   │   ┌─────┤ │  
 23 : E[0] <- arith:(E[1] op:('+' / '-') E[0]) / E[1]  │   │   │b - c│ │  
                                                       │   │   ├─────┤ │  
                        22 : E[1] op:('+' / '-') E[0]  │   │   │b - c│ │  
                                                       │   │   │   ┌─┤ │  
 23 : E[0] <- arith:(E[1] op:('+' / '-') E[0]) / E[1]  │   │   │   │c│ │  
                                                       │   │ ┌─┤ ┌─┤ │ │  
                                       21 : '+' / '-'  │   │ │+│ │-│ │ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
 20 : E[1] <- arith:(E[2] op:('*' / '/') E[1]) / E[2]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
     17 : E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
        14 : E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4]  │   │a│ │b│ │c│ │  
                                                       │   ├─┤ ├─┤ ├─┤ │  
                         12 : num:[0-9]+ / sym:[a-z]+  │   │a│ │b│ │c│ │  
                                                       ├─┐ ├─┤ ├─┤ ├─┤ │  
                                          10 : [a-z]+  │x│ │a│ │b│ │c│ │  
                                                       │ │ │ │ │ │ │ ├─┤  
                                   9 : [terminal] ';'  │ │ │ │ │ │ │ │;│  
                                                       │ │ │ ├─┤ │ │ │ │  
                                   8 : [terminal] '+'  │ │ │ │+│ │ │ │ │  
                                                       │ │ │ │ │ ├─┤ │ │  
                                   2 : [terminal] '-'  │ │ │ │ │ │-│ │ │  
                                                       │ ├─┤ │ │ │ │ │ │  
                                   1 : [terminal] '='  │ │=│ │ │ │ │ │ │  
                                                       ├─┤ ├─┤ ├─┤ ├─┤ │  
                                 0 : [terminal] [a-z]  │x│ │a│ │b│ │c│ │  
                                                        0 1 2 3 4 5 6 7 
                                                        x = a + b - c ; 

AST for rule "Program":

└─Program : 0+8 : "x=a+b-c;"
  ├─var : 0+1 : "x"
  └─arith : 2+5 : "a+b-c"
    ├─sym : 2+1 : "a"
    ├─op : 3+1 : "+"
    └─arith : 4+3 : "b-c"
      ├─sym : 4+1 : "b"
      ├─op : 5+1 : "-"
      └─sym : 6+1 : "c"

maxfischer2781 commented 3 years ago

Sorry for the lack of reproducible example, had to brush up my Java skills first...

An MRE for the problem is this program using the reference implementation:

import pikaparser.grammar.MetaGrammar;
import pikaparser.memotable.MemoTable;
import pikaparser.parser.utils.ParserInfo;

public class MainClass {
    public static void main(String[] args) {
        final var grammar = MetaGrammar.parse("Program <- Statement+;\n" //
                + "Statement <- var:[a-z]+ '=' Sum ';';\n" //
                + "Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term;\n" //
                + "Term <- num:[0-9]+ / sym:[a-z]+;\n");
        final var memoTable = grammar.parse("x=a+b-c;");
        ParserInfo.printParseResult("Program", memoTable, new String[] {"Statement"}, false);
    }
}

That's Python's PEG rule for +/-, so known to work for a left-recursive Packrat. I've only translated from PEP-617 PEG syntax to standard PEG syntax and used add: names instead of {_Py_BinOp(..., ADD, ...)} actions. (I haven't been able to rewrite it into the automatic precedence/associativity form.)

Pika does not parse a+b-c, i.e. when lower choice index term comes first. It succeeds to parse a-b+c (lower choice index last), as well as a+b+c and a-b-c (same choice index).


Clauses:
 13 : Program <- Statement+
 12 : Statement <- var:[a-z]+ '=' Sum ';'
 11 : Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term
 10 : Sum '-' Term
  9 : Sum '+' Term
  8 : Term <- num:[0-9]+ / sym:[a-z]+
  7 : [0-9]+
  6 : [a-z]+
  5 : ';'
  4 : '-'
  3 : [0-9]
  2 : '+'
  1 : '='
  0 : [a-z]

Memo Table:
 13 : Program <- Statement+                                       ---------
 12 : Statement <- var:[a-z]+ '=' Sum ';'                         ---------
 11 : Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term  #-#==-#--
 10 : Sum '-' Term                                                --#====--
  9 : Sum '+' Term                                                --#==----
  8 : Term <- num:[0-9]+ / sym:[a-z]+                             #-#-#-#--
  7 : [0-9]+                                                      ---------
  6 : [a-z]+                                                      #-#-#-#--
  5 : [terminal] ';'                                              -------#-
  4 : [terminal] '-'                                              -----#---
  3 : [terminal] [0-9]                                            ---------
  2 : [terminal] '+'                                              ---#-----
  1 : [terminal] '='                                              -#-------
  0 : [terminal] [a-z]                                            #-#-#-#--
                                                                  01234567
                                                                  x=a+b-c;

Match tree for rule Program:
                                                                   ░░░┌─────┐░░░░░
 11 : Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term   ░░░│a + b│░░░░░
                                                                   ░░░├─────┤░░░░░
                                                9 : Sum '+' Term   ░░░│a + b│░░░░░
                                                                  ┌─┐░├─┐   │░┌─┐░
 11 : Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term  │x│░│a│   │░│c│░
                                                                  ├─┤░├─┤ ┌─┤░├─┤░
                             8 : Term <- num:[0-9]+ / sym:[a-z]+  │x│░│a│ │b│░│c│░
                                                                  ├─┤░├─┤ ├─┤░├─┤░
                                                      6 : [a-z]+  │x│░│a│ │b│░│c│░
                                                                  │ │░│ │ │ │░│ ├─┐
                                              5 : [terminal] ';'  │ │░│ │ │ │░│ │;│
                                                                  │ │░│ │ │ ├─┤ │ │
                                              4 : [terminal] '-'  │ │░│ │ │ │-│ │ │
                                                                  │ │░│ ├─┤ │ │ │ │
                                              2 : [terminal] '+'  │ │░│ │+│ │ │ │ │
                                                                  │ ├─┤ │ │ │ │ │ │
                                              1 : [terminal] '='  │ │=│ │ │ │ │ │ │
                                                                  ├─┤ ├─┤ ├─┤ ├─┤ │
                                            0 : [terminal] [a-z]  │x│ │a│ │b│ │c│ │
                                                                   0 1 2 3 4 5 6 7
                                                                   x = a + b - c ;

====================================

No matches for Program <- Statement+

====================================

No matches for Statement <- var:[a-z]+ '=' Sum ';'

====================================

Matches for Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term :
└─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 0+1 : "x"
  └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 0+1 : "x"
    └─sym:[a-z]+ : 0+1 : "x"
      └─[a-z] : 0+1 : "x"

└─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+3 : "a+b"
  └─add:(Sum '+' Term) : 2+3 : "a+b"
    ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
    │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
    │   └─sym:[a-z]+ : 2+1 : "a"
    │     └─[a-z] : 2+1 : "a"
    ├─'+' : 3+1 : "+"
    └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
      └─sym:[a-z]+ : 4+1 : "b"
        └─[a-z] : 4+1 : "b"

====================================

Matches for Sum '-' Term :
└─Sum '-' Term : 2+5 : "a+b-c"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+3 : "a+b"
  │ └─add:(Sum '+' Term) : 2+3 : "a+b"
  │   ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │   │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   │   └─sym:[a-z]+ : 2+1 : "a"
  │   │     └─[a-z] : 2+1 : "a"
  │   ├─'+' : 3+1 : "+"
  │   └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │     └─sym:[a-z]+ : 4+1 : "b"
  │       └─[a-z] : 4+1 : "b"
  ├─'-' : 5+1 : "-"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    └─sym:[a-z]+ : 6+1 : "c"
      └─[a-z] : 6+1 : "c"

====================================

Matches for Sum '+' Term :
└─Sum '+' Term : 2+3 : "a+b"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   └─sym:[a-z]+ : 2+1 : "a"
  │     └─[a-z] : 2+1 : "a"
  ├─'+' : 3+1 : "+"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    └─sym:[a-z]+ : 4+1 : "b"
      └─[a-z] : 4+1 : "b"

====================================

Matches for Term <- num:[0-9]+ / sym:[a-z]+ :
└─Term <- num:[0-9]+ / sym:[a-z]+ : 0+1 : "x"
  └─sym:[a-z]+ : 0+1 : "x"
    └─[a-z] : 0+1 : "x"

└─Term <- num:[0-9]+ / sym:[a-z]+ : 2+1 : "a"
  └─sym:[a-z]+ : 2+1 : "a"
    └─[a-z] : 2+1 : "a"

└─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  └─sym:[a-z]+ : 4+1 : "b"
    └─[a-z] : 4+1 : "b"

└─Term <- num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
  └─sym:[a-z]+ : 6+1 : "c"
    └─[a-z] : 6+1 : "c"

====================================

No matches for [0-9]+

====================================

Matches for [a-z]+ :
└─[a-z]+ : 0+1 : "x"
  └─[a-z] : 0+1 : "x"

└─[a-z]+ : 2+1 : "a"
  └─[a-z] : 2+1 : "a"

└─[a-z]+ : 4+1 : "b"
  └─[a-z] : 4+1 : "b"

└─[a-z]+ : 6+1 : "c"
  └─[a-z] : 6+1 : "c"

====================================

Matches for ';' :
└─';' : 7+1 : ";"

====================================

Matches for '-' :
└─'-' : 5+1 : "-"

====================================

No matches for [0-9]

====================================

Matches for '+' :
└─'+' : 3+1 : "+"

====================================

Matches for '=' :
└─'=' : 1+1 : "="

====================================

Matches for [a-z] :
└─[a-z] : 0+1 : "x"

└─[a-z] : 2+1 : "a"

└─[a-z] : 4+1 : "b"

└─[a-z] : 6+1 : "c"

====================================

AST for rule "Program":

SYNTAX ERRORS:

0+8 : x=a+b-c;

Num match objects created: 26
Num match objects memoized:  25

lukehutch commented 3 years ago

Thanks for the reproducer. I see the same result as you.

Weirdly though, the correct expression is present in the memo table!

Matches for Sum '-' Term :
└─Sum '-' Term : 2+5 : "a+b-c"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+3 : "a+b"
  │ └─add:(Sum '+' Term) : 2+3 : "a+b"
  │   ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │   │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   │   └─sym:[a-z]+ : 2+1 : "a"
  │   │     └─[a-z] : 2+1 : "a"
  │   ├─'+' : 3+1 : "+"
  │   └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │     └─sym:[a-z]+ : 4+1 : "b"
  │       └─[a-z] : 4+1 : "b"
  ├─'-' : 5+1 : "-"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    └─sym:[a-z]+ : 6+1 : "c"
      └─[a-z] : 6+1 : "c"

So something is going wrong at the last step. Maybe you already thought it through, but I'm going to have to think about what's going on here...

Ironically I just invented a new parsing algorithm over the last few days that is dramatically simpler than the pika parser, and faster, and works top-down, but also handles left recursion properly (or should do...). I have a basic prototype working, but haven't tested it much yet, and haven't written it up yet. I'll test this example with it though to make sure it works.

lukehutch commented 3 years ago

Also I noticed that a-b+c parses correctly... very weird...

lukehutch commented 3 years ago

OK, I understand what's going on here... this is probably what you already explained, but:

The function used to only update memo table entries if they improve on the previous match for a given clause at a given input position is too simplistic. Currently it is (from the class Match):

    /**
     * Compare this {@link Match} to another {@link Match} of the same {@link Clause} type and start position.
     * 
     * @return true if this {@link Match} is a better match than the other {@link Match}.
     */
    public boolean isBetterThan(Match other) {
        if (other == this) {
            return false;
        }
        // An earlier subclause match in a First clause is better than a later subclause match
        // A longer match (i.e. a match that spans more characters in the input) is better than a shorter match
        return (memoKey.clause instanceof First // 
                && this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) //
                || this.len > other.len;
    }

In this case, Sum '+' Term has firstMatchingSubClauseIdx = 0, and Sum '-' Term has firstMatchingSubClauseIdx = 1. So even though

└─Sum '-' Term : 2+5 : "a+b-c"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+3 : "a+b"
  │ └─add:(Sum '+' Term) : 2+3 : "a+b"
  │   ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │   │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   │   └─sym:[a-z]+ : 2+1 : "a"
  │   │     └─[a-z] : 2+1 : "a"
  │   ├─'+' : 3+1 : "+"
  │   └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
  │     └─sym:[a-z]+ : 4+1 : "b"
  │       └─[a-z] : 4+1 : "b"
  ├─'-' : 5+1 : "-"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 6+1 : "c"
    └─sym:[a-z]+ : 6+1 : "c"
      └─[a-z] : 6+1 : "c"

fully contains the subtree of

└─Sum '+' Term : 2+3 : "a+b"
  ├─Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term : 2+1 : "a"
  │ └─Term <- term:(num:[0-9]+ / sym:[a-z]+) : 2+1 : "a"
  │   └─sym:[a-z]+ : 2+1 : "a"
  │     └─[a-z] : 2+1 : "a"
  ├─'+' : 3+1 : "+"
  └─Term <- num:[0-9]+ / sym:[a-z]+ : 4+1 : "b"
    └─sym:[a-z]+ : 4+1 : "b"
      └─[a-z] : 4+1 : "b"

once the parent clause (Sum '+' Term) / (Sum '-' Term) / Term is parsed, the shorter (Sum '+' Term) will always overwrite the longer and deeper tree (Sum '+' Term).

The solution here would be to add an earlier check in the thisMatch.isBetterThan(otherMatch) function that will return false if thisMatch is a sub-tree of otherMatch.

Searching otherMatch for the root node of thisMatch would make memoization take O(d^2) in the depth of the parse tree, however! So that's not a great solution.

Another solution might be to add a depth field to Match instances, which starts at 0 for leaves, and adds 1 as each level is added to the match tree, bottom-up. But at a node with more than one child, the depth would be the max of all the child depths, plus 1. Then in isBetterThan, a deeper tree is determined to be a better match than a shallower tree. This should work, since we're only talking about the depth from the leaf for a specific clause at a specific character position, so if the depth increases, then necessarily the parsing must have gone around an additional loop in the grammar.

At least I think that's the correct solution. What do you think?

lukehutch commented 3 years ago

Sorry for the step-by-step updates on my thoughts...

Actually now that I rethink this, I think that the pika parser may be doing exactly what it is supposed to do with this grammar, depending upon how you interpret left recursion...

Consider the rule Sum <- (Sum '+' Term) / (Sum '-' Term) / Term. Remember that PEG is defined as working left-to-right, consuming as much input as possible, greedily. So the steps that should be followed with left recursion handling that respects these semantics:

(Sum '+' Term) will recurse down to Sum, and hit an infinite loop, so the nested Sum match will initially fail to match, causing (Sum '+' Term) to fail. Same for (Sum '-' Term).
This leaves only the third option, Term to match for Sum at the deepest level.
Returning up one level of recursion from this bottom match, we now have Sum <- Term, so (Sum '+' Term) effectively matches (Term '+' Term). This consumes the input a+b.
The (Sum '-' Term) clause is never triggered, because the optional rule (First in the terminology I use in my paper) has already greedily consumed as much input as it can consume. So the -c part is never matched.

For the semantics that you describe the Python parser implementing, it has to override the left-to-right greedy semantics of PEG to favor maximal depth expansion of the parse tree over the strict parsing order. (Actually the new parsing algorithm I came up with over the last few days does this too, but not by design...)

I see the maximal expansion semantics of the Python parser as more useful, but the strict ordering of the pika parser as more correct. But maybe my assumptions about correctness are wrong.

Part of the problem is that the behavior under left recursion is incompletely specified, so you have to make some decisions about precisely what semantics you are going to implement to handle left recursion. Do you know of a mathematical justification that would rule that the maximum depth expansion semantics of the Python parser is somehow more correct than the strict semantics of the pika parser?

lukehutch commented 3 years ago

So, for reference, the maximum expansion semantics of the Python parser (and my new unreleased parser) is something close to the following, for max expression depth of 3:

Program <- Statement+;
Statement <- var:[a-z]+ '=' Sum3 ';';
Sum3 <- add:(Sum2 '+' Term) / sub:(Sum2 '-' Term) / term:Term;
Sum2 <- add:(Sum1 '+' Term) / sub:(Sum1 '-' Term) / term:Term;
Sum1 <- term:Term;
Term <- num:[0-9]+ / sym:[a-z]+;

... and this grammar parses inputs correctly, even with the pika parser.

(Note that Sum1 <- term:Term; is the "bottom clause" used to terminate recursion, by incorporating only the failover term, / term:Term that is present at the end of the original grammar rule, Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term.)

But this is not the only way to interpret the left recursion pattern in your grammar. The other way is to only expand the grammar as far as needed to greedily satisfy the immediate option. In this case, given the combination of this particular grammar and input string, this becomes equivalent to:

Program <- Statement+;
Statement <- var:[a-z]+ '=' Sum3 ';';
Sum3 <- add:(Sum2Plus '+' Term) / sub:(Sum2Minus '-' Term) / term:Term;
Sum2Plus <- term:Term;
Sum2Minus <- add:(Sum1Minus '+' Term) / sub:(Sum1Minus '-' Term) / term:Term;
Sum1Minus <- term:Term;
Term <- num:[0-9]+ / sym:[a-z]+;

This is what the pika parser is doing, more or less. The above grammar can't parse a+b-c.

maxfischer2781 commented 3 years ago

Part of the problem is that the behavior under left recursion is incompletely specified

That's probably a key point here: I mistakenly assumed the Python PEG was the left-recursive PEG behaviour, but you are right that it's just a left-recursive PEG behaviour.

In a way I am fine to say "that's just according to Pika's rules" – what I like about your Pika algorithm is that it is a very straightforward application of actual rules, not just behaviour. As you mention, left-recursive top-down parsing sometimes tends to do useful things "not by design". ;) (FWIW, it took me a lot longer to figure out why the Python parser actually "worked" than why the Pika parser "failed".)

Now, as you say the "unreleased parser" behaviour certainly seems useful. So it would be nice to formalise it the way Pika is (could be?) formalised. I'll just use the terms "Pika" for the current Pika rules, and "unPEG" for your unreleased parser / Python's parser rules.

Casually speaking, I think what the difference boils down to is whether one sees parsing as a sequence or tree of operations. (Casually, as the former is probably a tree of sequences or somesuch and we're likely somewhere between the two. There be handwaving now.)
If we look at parsing as a sequence, then once we reach the end/start we are done. That matches the idea of Pika having a 2D match table, and likewise each First rule having a well-defined order.
If we look at parsing as a tree, then we can have separate parent layers of the same rule. That matches the idea of unPEG adding/expanding a single Sum to nested Sum1, Sum2, … each with their own start and end.

Let me go out on a limb and speculate that Pika already has a bit of the latter (there be less handwaving now). One can unify matching of "recursive sequence" and "recursive choice" into a single strategy.

If we have some well-behaved left-recursive expanding choice like this

A <- E:(A T:'a') / T:'a'

and match an input like aaa then we expect E to help expand the rule to match the entire input. With Pika, we achieve that by:

E recursing into A and thus itself, and
A storing a new match if is longer.

However, we can fully reformulate what Pika does as a tree:

E=(E=(T0 T1) T2)
   |          \
   E=(T0 T1)  'a'
   |      \
   T0      'a'
   |
   a'

This is actually what the algorithm does internally, with the subtrees existing both spuriously and as part of the final match. Now, the longest-match rule means Pika only compares the flattened trees. However, logically we could just as well say that Pika works via subtree containment: the match A->(T0 T1) is replaced by A->((T0 T1) T2) because the latter contains the former. Since matches span a range, a match containing another is automatically at least as long[*].

So my hunch is the "longest match" rule is actually just an optimisation/special case of "sub-tree match", namely for the case of recursive sequence. That could justify using "sub-tree match" also for recursive choices.

[*] This is still a weak spot. Pika requires strict longest match, with an earlier match winning a tie. Containment would require soft longest matches, with a later match winning a tie. So containment could always expand some match M to M ε. I'm not sure whether that is relevant, harmful, or perhaps even desirable.[**]

[**] Since a recursive rule R <- R ε / B (using some base rule B to get started) would expand infinitely, I have a hunch this is not desirable. At least not if the expansion is just by ε.

maxfischer2781 commented 3 years ago

(I might infodump here over the next few days as I'm working through understanding this. I'll try to keep each topic together, but let me know if it is too much rambling or too little fleshed-out.)

Do you know of a mathematical justification that would rule that the maximum depth expansion semantics of the Python parser is somehow more correct than the strict semantics of the pika parser?

Mathematically, it boils down the whether the binary operators Seq and First are distributive or only right-distributive over First.

In a purely right-recursive case, PEG, Pika and PythonPEG agree that the following are the same language:

RS <- (F0 RS) / (F1 RS)
RS <- (F0 / F1) RS

This means Seq is right-distributive over First.

What they disagree is the left-recursive case, in that PEG disallows, Pika distinguishes and PythonPEG equates the following:

LS <- (LS F0) / (LS F1)
LS <- LS (F0 / F1)

This means Seq isn't/is left-distributive over First for Pika/PythonPEG.

Taken at face value, this is just a choice of how the operations should behave – both are justifiable and neither is actually "correct". As a meatbag, having left- and right-behaviour equal is nice/preferable, but there are other cases where the two differ (being the entire point why left-recursion is hard).

I haven't dug much deeper whether there are some nice properties that could come from either choice. Neither Seq nor First are commutative, which excludes similarity to a lot of useful categories. The only other math'y relations seem to be that both Seq and First are associative ((a b) c = a (b c) / (a / b) / c = a / (b / c)) and Seq has the neutral element ε/"empty string" with a ε = ε a = a.

lukehutch commented 3 years ago

In a way I am fine to say "that's just according to Pika's rules" – what I like about your Pika algorithm is that it is a very straightforward application of actual rules, not just behaviour. As you mention, left-recursive top-down parsing sometimes tends to do useful things "not by design". ;)

Yes, although even simple rules can be wrong :-) (Or at least sub-optimal.) I do really want to get to the bottom of this to see if there's a way to improve the pika parser -- I don't want to claim that its way is the best way. And as you pointed out, the "unPEG" behavior is certainly more useful, which is a valuable thing.

re. your point on "soft containment": correct, the pika parser does require strictly longer matches -- that is the only way to guarantee the algorithm will terminate (because it has to make monotonic progress in each step). But this is not a problem in practice, because this rule is only applied when updating a memo entry, where the memo entries are keyed with the pair (clause, position). In other words, if a memo entry is being updated, and the "is the new match longer than the old match" test is being applied, the restriction being imposed is that while progressing through one complete cycle of the grammar, at least one character is consumed. Clearly if you can progress through a complete cycle and at least one character is not consumed, you have hit an infinite cycle. However, if a parent clause (of different clause type) has the same length as a child clause, it's no problem for the parent clause and the child clause to have the same length, because the clause types are different.

Re. expanding M to M ε: I think you would have to consider those two rules to be exactly equivalent in a strong sense -- in the same sort of sense that 0.999999... and 1.0 are actually mathematically exactly equal. A well-developed parser should just swallow ε terms silently if they are subclauses of a Seq clause -- among many other simplifications and rule patterns that should be disallowed. For example, packrat parsers are claimed to take linear time, but actually a rule R <- ε+ will cause a packrat parser to get stuck in an infinite loop! (In this case, the problem is that ε does not allow the parser to make monotonic progress, so it stays where it is, and tries again, up to an infinite number of times.) The pattern A <- X / Y / ε / Z can be simplified to A <- X / Y / ε, since ε always matches, so it can also be rewritten as A <- (X / Y)?, etc., and A <- X / Y / (Z?) / W can be simplified to A <- X / Y / (Z?), since Z? always matches (it is equivalent to Z / ε). There are all sorts of corner cases like this, and many of them revolve around matching the empty string.

Now given these observations about making monotonic progress and ε, re. your distinction between the left-recursive and right-recursive cases when discussing the distributive property: it's important to realize that recursion is only right-recursion if the parser consumed at least one character before recursing. Left recursion isn't necessarily just recursion down the leftmost clause of each rule! This is something I only realized when creating the pika parser, and without accounting for this, the parser doesn't work. You'll notice that the pika parser goes out of its way to figure out which clauses may consume zero characters while still matching. For example, the following grammar is actually left-recursive!

A <- (B? A) / 'a'
B <- 'b'

The reason is that if the input doesn't start with at least one b, then the parser doesn't consume at least one character before recursing from (clause = A, position = 0) right back to (clause = A, position = 0). If you consider the (clause, position) tuple to be the only actual state of the parser (all other values being returned by recursion, i.e. ignoring the memo table as state), then clearly in a pure functional implementation, where the only "state" that can be considered is the parameter values at entry to each function, then when the parser repeats any complete state, it is stuck in infinite recursion.

That's not particularly important for the issue of how to expand left recursion, but it is an important distinction to make when thinking about which grammars require handling of left recursion.

I think you hit the nail on the head with your observation that Seq isn't left-distributive over First for the pika parser.

This was borne out when I tried to implement the fix I suggested a few comments ago, of changing isBetterThan to check the depth to most distant leaf node before checking the first matching subclause index for First clauses, and before checking the length of match. The attempted fix didn't even work, because it's applied only when attempting to write a new memo to the memo table (the new match is compared to the old match) -- and by that point, the First clause in your grammar has already selected the first + subclause (the shorter match) in preference to the second - subclause (the longer match) -- i.e. the logic inside First itself doesn't look at depth to leaves (and changing the logic of First to prefer deeper subtrees to shallower subtrees, before finding the first matching subclause, would fundamentally change the semantics of the First operator, which is a no-no).

So I don't think this behavior is changeable. I tried hard to prove that right-to-left, bottom-up population of the dynamic programming table, as done for the pika parser, could be made exactly semantically equivalent to recursive descent parsing, but I think there's only one possible distributivity option for left recursion when working bottom-up. Maybe I'm wrong about that, but I don't see how to implement different behavior.

maxfischer2781 commented 3 years ago

Thanks for the extensive writeup – you've already cleared up a lot for me here. I'm currently trying to formulate a proper rule how First would be a consistent rule with left-recursive depth matching, but aren't done with that yet.

Though perhaps some food for thought re. changing First semantics being a no-no:
Consider that PEG is not left-recursive – or perhaps to put it in better (your ;)) words it must make monotonic progress before recursion.[*] So the PEG ordered choice is defined only for the simple case that we always consume at least one character. The important part is that PEG doesn't actually bother with the idea of "multiple matches at the same position" – whatever Pika or another left-recursive PEG does in the left-recursive case cannot conflict with standard PEG (though of course nasal demons would be a bit farfetched).
What PEG defines for ordered choice is that it matches the first matched branch in top-down order. One should ponder a) in what order a left-recursive top-down would "see" an ordered choice b) whether revisiting an ordered choice makes it "the same". Since the original PEG is defined as a "matching machine", it has no concept of "rule identity" – so treating a re-visited rule as identical to its previous iteration does not backed by PEG.[**]

More generally, it might be useful to disentangle that "multiple matches at the same position" conflates two things:

Left-recursion means a new match depends on a previous match. The old match is not discarded, it is integrated into the new match. The new match requires the old match.
Bottom-up (Pika) means a new match supersedes a previous match. The old match is discarded entirely. The new match does not require the old match; the old match may or may not exist, even if it would have matched locally.

Notably, the latter is more of an (important) implementation detail of Pika; it exists to "recover" the top-down behaviour from a bottom-up behaviour. The "lower First ClauseIdx" rule seems to originate from this category of recovering standard PEG behaviour, not necessarily of defining recursive PEG behaviour.

[*] The implication of "left-recursion" being a dual of "right-recursion" has tripped me up several times. As you mention, "left" and "right" aren't actually important for the distinction – it's about stationary recursion versus progressing recursion.

[**] This by itself isn't useful to say what is "correct" though – PEG being oblivious to re-visiting a rule is one reason why it's not left-recursive. It only says that one can fill in that blank, not how one should do so.

lukehutch commented 3 years ago

[*] The implication of "left-recursion" being a dual of "right-recursion" has tripped me up several times. As you mention, "left" and "right" aren't actually important for the distinction – it's about stationary recursion versus progressing recursion.

Yes! You put this better than I could have. There is actually no such thing as "right recursion" -- there is only "left recursion" (recursion that will enter an infinite loop if a loop is traversed in the grammar, i.e. recursion where no forward progress is made) and "everything else", i.e. "non-left-recursion" (recursion where an infinite loop is impossible, because the parser moved forward at least one character).

(I used to think in terms of left and right recursion too, and had the same yearning for left/right symmetry that you have... but actually this conversation is helping me to see that it is probably worth writing a paper on this one point alone, because it's a fundamentally interesting and very deep insight that I think is simply not known... and I never would have realized this if I hadn't written the pika parser, which deeply depends upon this distinction of forward progress for the parser to even work.)

Here's another absolute paradigm shift that I had that actually led to the development of the pika parser: in actuality the following is not true:

PEG defines for ordered choice is that it matches the first matched branch in top-down order.

The truth is that matching always happens in bottom-up order, since whether or not a given clause C matches is only determined once all requisite subclauses of C have matched (depending on the clause type of C). Restated, it is right before the exit of a recursive frame the clause C is determined to match or not match, depending on whether or not the requisite subclause(s) of C matched. Therefore, matching is actually a bottom-up process!

The following is also not quite true:

Bottom-up (Pika) means a new match supersedes a previous match. The old match is discarded entirely. The new match does not require the old match; the old match may or may not exist, even if it would have matched locally.

Remember that memo table entries are keyed by the "parsing state" of (clause, position) (given the model of "parsing state" that I gave in my previous comment, where if a state is repeated, the parser must have entered a cycle) -- so an old match in a given memo table entry is only overwritten by a better match if that means that a better match is available for the same parsing state. Since the pika parsing algorithm requires monotonic progress to overwrite a match with a new match[], when a memo entry is overwritten, this generally means that one complete loop around a cycle in the grammar has been achieved, which means that the old match is actually a subtree of the new match*. So the old match is not discarded, it is actually implicitly incorporated into the new match as a subtree.

[] There is a caveat: the First clause types prioritize lower subclause match indices over higher subclause match indices, ignoring the length of match if the index of the first matching subclause decreased. So monotonic progress does not have to be made in length for First clause types, but rather monotonic progress must be made towards earlier and earlier subclauses matching. Obviously this will also cause the parsing loop to terminate at some point, since there are a finite and fixed number of subclauses of a First clause. But if an earlier subclause of a First clause matches in the new match compared to the old match in the memo table, that must* mean that a descendant clause of the subclause now achieves a longer match, as a result of looping around a cycle in the grammar while some clause in the cycle consumed at least one character, causing the descendant clause's memo table match to be monotonically updated. So at least a descendant clause must incorporate its old match as a subtree of the new match before an ancestral First clause can match an earlier subclause than its own match in the memo table. So First can discard an old match in the memo table for a new match, but only if a descendant clause doesn't discard but rather implicitly reuses (by incorporation as a subtree) its older memo table match. I hope that's not too confusing!

lukehutch commented 3 years ago

Good news -- the new top-down parser (the "squirrel parser") works. I have a full implementation here. I will be writing this up as a paper, so please don't steal the idea ( :-) ), but I thought you'd like to take a look at it!

https://github.com/lukehutch/squirrelparser

You can run an example here:

https://github.com/lukehutch/squirrelparser/blob/main/src/main/java/squirrelparser/main/Main.java

The main parsing algorithm, which enables handling of left recursion, is in Rule.match:

https://github.com/lukehutch/squirrelparser/blob/main/src/main/java/squirrelparser/grammar/Rule.java

The key ideas are:

Memo lookup should always be disabled when no forward progress has been made, i.e. the memo table is not consulted when trying to match a subclause, if the clause and its subclause have the same start position. (However, every time a match attempt is made, whether or not memo lookup is disabled, the resulting match is still written to the memo table (to avoid duplicating work in future, in order to retains time-linearity.)
- The purpose of this is to ensure that earlier/lower/shorter matches don't prevent the recursive matching of later/higher/longer matches in the same position.
- There is one exception: when a cycle is closed (or in other words, when the same (clause, position) parsing state is reached twice in the current path through the grammar), then memo lookup is enabled for the nested instance of (clause, position) -- and if there is no memo table entry for (clause, position), then NO_MATCH is returned (this is the bottom or "innermost" instance of (clause, position)).
Then, if (clause, position) is reached twice in the current path through the grammar, the higher (outermost) of the two (clause,position) instances is marked for looping. Once recursion exits back up to that point, the parser starts looping, repeatedly trying to match the same clause at the same position, until matches no longer improve (which works pretty much the same as isBetterThan in the pika parser). This grows the left recursive part of the parse tree downwards from the current point, expanding one cycle around the grammar for each pass around the loop, and incorporating the parse tree fragment (stored as a match in the memo table) from the previous pass around the grammar cycle as a subtree of the current pass around the grammar cycle.

Hopefully that makes sense... it's a lot to wrap one's mind around, but in my testing, it matches both left-associative (left-recursive) and right-associative (non-left-recursive) rules.

Warth et al. (2008) starts with the same basic idea of growing left-recursive seeds downwards, but ends up significantly more complicated than the squirrel parser: http://web.cs.ucla.edu/~todd/research/pepm08.pdf

Here's an example of the squirrel parser in action:

Left-associative grammar:

Program <- Statement+;
Statement <- var:[a-z]+ '=' E[0] ';';
E[4] <- '(' E[0] ')';
E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4];
E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3];
E[1] <- arith:(E[1] op:('*' / '/') E[2]) / E[2];
E[0] <- arith:(E[0] op:('+' / '-') E[1]) / E[1];

AST:

└─<root> : 0+23 : "discriminant=b*b-4*a*c;"
  ├─var : 0+12 : "discriminant"
  └─arith : 13+9 : "b*b-4*a*c"
    ├─arith : 13+3 : "b*b"
    │ ├─sym : 13+1 : "b"
    │ ├─op : 14+1 : "*"
    │ └─sym : 15+1 : "b"
    ├─op : 16+1 : "-"
    └─arith : 17+5 : "4*a*c"
      ├─arith : 17+3 : "4*a"
      │ ├─num : 17+1 : "4"
      │ ├─op : 18+1 : "*"
      │ └─sym : 19+1 : "a"
      ├─op : 20+1 : "*"
      └─sym : 21+1 : "c"

Right-associative grammar:

Program <- Statement+;
Statement <- var:[a-z]+ '=' E[0] ';';
E[4] <- '(' E[0] ')';
E[3] <- (num:[0-9]+ / sym:[a-z]+) / E[4];
E[2] <- arith:(op:'-' (E[2] / E[3])) / E[3];
E[1] <- arith:(E[2] op:('*' / '/') E[1]) / E[2];
E[0] <- arith:(E[1] op:('+' / '-') E[0]) / E[1];

AST:

└─<root> : 0+23 : "discriminant=b*b-4*a*c;"
  ├─var : 0+12 : "discriminant"
  └─arith : 13+9 : "b*b-4*a*c"
    ├─arith : 13+3 : "b*b"
    │ ├─sym : 13+1 : "b"
    │ ├─op : 14+1 : "*"
    │ └─sym : 15+1 : "b"
    ├─op : 16+1 : "-"
    └─arith : 17+5 : "4*a*c"
      ├─num : 17+1 : "4"
      ├─op : 18+1 : "*"
      └─arith : 19+3 : "a*c"
        ├─sym : 19+1 : "a"
        ├─op : 20+1 : "*"
        └─sym : 21+1 : "c"

lukehutch commented 3 years ago

To close the loop on your original example, the new parser (the "squirrel parser") can parse your example correctly either way:

Code:

    private static void tryParsing(Grammar grammar, String input) {
        var parser = new Parser(grammar, input);
        var match = parser.parse();
        if (match == Match.NO_MATCH) {
            throw new IllegalArgumentException();
        }
        var ast = new ASTNode(match, parser.input);
        System.out.println(ast.toStringWholeTree());
    }

    public static void main(String[] args) {
        final var grammar = MetaGrammar.parse("Program <- Statement+;\n" //
                + "Statement <- var:[a-z]+ '=' Sum ';';\n" //
                + "Sum <- add:(Sum '+' Term) / sub:(Sum '-' Term) / term:Term;\n" //
                + "Term <- num:[0-9]+ / sym:[a-z]+;\n");
        tryParsing(grammar, "x=a+b-c;");
        tryParsing(grammar, "x=a-b+c;");
    }

Output:

└─<root> : 0+8 : "x=a+b-c;"
  ├─var : 0+1 : "x"
  └─sub : 2+5 : "a+b-c"
    ├─add : 2+3 : "a+b"
    │ ├─term : 2+1 : "a"
    │ │ └─sym : 2+1 : "a"
    │ └─sym : 4+1 : "b"
    └─sym : 6+1 : "c"

└─<root> : 0+8 : "x=a-b+c;"
  ├─var : 0+1 : "x"
  └─add : 2+5 : "a-b+c"
    ├─sub : 2+3 : "a-b"
    │ ├─term : 2+1 : "a"
    │ │ └─sym : 2+1 : "a"
    │ └─sym : 4+1 : "b"
    └─sym : 6+1 : "c"

lukehutch commented 3 years ago

Also FYI I measured the squirrel parser as 2-32x faster than the pika parser for the same input, depending on the grammar complexity (32x is for the Java 1.6 PEG grammar).

lukehutch commented 3 years ago

All that said, I'd still like to close the chapter on how to classify different types of left recursion handling, so if you have further thoughts on that, given the clarifications that I offered before introducing the squirrel parser, I would still value your perspectives. It's very possible that pika parser's left recursion handling is somehow suboptimal, but cannot be fixed due to working fully bottom-up.

lukehutch commented 3 years ago

Sorry, continuing my thoughts here on different ways to interpret left recursion...

Grammar (a) below is ambiguous, so it's particularly interesting for interpreting left recursion behavior of a parser. Grammar (b) is explicitly left-associative.

Grammar (a):

E <- sum:(E op:'+' E) / N;
N <- num:[0-9]+;

Given the input 0+1+2+3, both the squirrel parser and the pika parser produce this, which is a right-associative result by default! I was highly surprised about this. Thinking about it, it makes sense, but it's not what I expected.

└─<root> : 0+7 : "0+1+2+3"
  └─sum : 0+7 : "0+1+2+3"
    ├─num : 0+1 : "0"
    ├─op : 1+1 : "+"
    └─sum : 2+5 : "1+2+3"
      ├─num : 2+1 : "1"
      ├─op : 3+1 : "+"
      └─sum : 4+3 : "2+3"
        ├─num : 4+1 : "2"
        ├─op : 5+1 : "+"
        └─num : 6+1 : "3"

Grammar (b):

E <- sum:(E op:'+' N) / N;
N <- num:[0-9]+;

Both the squirrel parser and the pika parser produce this:

└─<root> : 0+7 : "0+1+2+3"
  └─sum : 0+7 : "0+1+2+3"
    ├─sum : 0+5 : "0+1+2"
    │ ├─sum : 0+3 : "0+1"
    │ │ ├─num : 0+1 : "0"
    │ │ ├─op : 1+1 : "+"
    │ │ └─num : 2+1 : "1"
    │ ├─op : 3+1 : "+"
    │ └─num : 4+1 : "2"
    ├─op : 5+1 : "+"
    └─num : 6+1 : "3"

I would be curious to hear how your parser (and the other Python parser you have referenced) behaves on these two examples.

maxfischer2781 commented 3 years ago

All that said, I'd still like to close the chapter on how to classify different types of left recursion handling

Same here, just need to set aside some time for it. Haven't forgotten about this and loving to see your thoughts. Will take a bit to put my thoughts into words.

lukehutch commented 3 years ago

More thoughts... take your time to digest and respond :-)

So I mentioned before that recursive descent parsing actually matches grammar clauses bottom-up, not top-down. The reason for that is that the PEG semantics define a postorder traversal of the grammar graph. (A clause only matches if all its requisite subclauses match.)

Let's define an "omniscient parser" to be a parser that knows exactly which subclauses will match, and how many characters each subclause will consume, before recursing to the subclauses. An omniscient parser also knows how deep each parse subtree should be (i.e. when to stop recursing), including for left-recursive paths through the grammar. An omniscient parser can build the parse tree using preorder traversal of the grammar. Therefore, an omniscient parser can build the parse tree top-down, rather than bottom-up. (In some sense this is a time-reversal of standard recursive descent parsing.)

Now as far as I can determine through thought experiments, only an omniscient parser could parse Grammar (a) from my previous comment in left-associative form. In other words, the parser would have to be both omniscient, and nondeterministic, in the sense that it would see all possible parses of a given clause before recursing to subclauses, then it could choose one of them.

For any non-omniscient parser, the only valid option for an ambiguous grammar like this is to expand it into right-associative form, since even when applying PEG rules greedily, top-down, you end up building the parse tree bottom-up (via postorder traversal), therefore the lefthand subclause match of (E ('+'/'-') E) is always going to consume only one E match at a time at the start position, whereas the righthand subclause will consume the rest of the expression. (The lefthand subclause consumes at least one character if it matches, so the righthand subclause is free to begin an entirely new recursive match process to the right of the start position, leading to right associativity.)

So my conclusion is that ambiguity cannot in general be resolved in left-recursive forms by a non-omniscient parser, at least without rewriting the grammar. (But maybe it's possible to change the behavior of the parser to simulate the grammar being rewritten into explicitly left-associative Grammar (b) form, or something.)

maxfischer2781 commented 3 years ago

I've only looked at the pieces of the squirrel responsible for selecting matches so far, so I might have missed something. But it seems to me the difference is here in Match.isBetterThan:

        // Compare first matching subclause index (for First)
        if (this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) {
            return true;
        }

where a leading choice immediately replaces a trailing choice but a trailing choice can be better by a following condition. That's in itself not a shock, since Pika has practically the exact same rule (though less sophisticated).

Now, here is the catch: Squirrel goes through the choices in order whereas Pika goes through the choices in reverse order. So Squirrel asks "does this leading choice replace this later choice" whereas Pika asks "does this later choice replace this leading choice".

This means Squirrel will much more often (always?) select matches by subclause while Pika selects matches by clause index.

Say we have choice <- choice a / choice b / base, then

pika compares "is the clause index for choice a lower than choice b" -> yes, it's better.
squirrel compares "is the clause index for choice b lower than choice a " -> no, check the subclause matches.

So while the rules for replacing matches look the same, swapping the operands significantly changes their meaning.

maxfischer2781 commented 3 years ago

I did some tests with slightly modifying Pika, and it's remarkably robust against removing the this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx filter. Passed all my unittests, with one exception – the one that failed before and triggered this ticket. (FWIW, that also means I need more tests. 😅)

My hunch is that it boils down to a well-behaved grammar having (mostly? only?)

exclusive choices like "a" / "b" / "c" for which only one clause matches, or
ordered choices like "abc" / "ab" / "a" for which the longest clause matches.

In these cases, just the "longest match" is sufficient to select the correct subclause. In my quick tests, "longest or equal match" seems sufficient to get the desired behaviour of First (but it fails for other clauses, e.g. Seq expansion by ε).

Of course, a parser should also "support" broken grammars correctly. Plus, it doesn't actually suffice to completely get the desired behaviour of expanding left-recursion.
But it's interesting to see that fundamental ordering of First clauses is already provided by Pika out of the box, and perhaps does not need strict enforcement.

lukehutch commented 3 years ago

On your last comment: I arrived at a similar conclusion: that the longest match of a First subclause at a given starting point is almost always going to be the intended match of that subclause, i.e. I agree that it is remarkable that the correct ordering semantics should pretty much put the length of possible matches (across all reasonable inputs) in decreasing order of length.

In an earlier draft of the paper, in addition to First I actually added a non-PEG operator, Longest, which did exactly this -- it tried matching all subclauses, then returned the longest matching subclause. That way, if you couldn't figure out what order to put the subclauses of a First clause in to avoid spurious matches (which is a tricky situation that actually comes up quite often in PEG grammar design), then you can just be lazy and use Longest. I actually got pushback from a couple of people who read the paper -- they asked why the heck I was adding a semantically dubious operator to PEG, and changing the semantics of the universe of recognizable languages. I agreed that this operator muddied the waters -- in part because it was very difficult to define in practical terms what the operator did (since it depended on interactions between structures in the grammar and patterns in the input), so I took it out of the final draft of the paper.

There are cases where accepting the longest matching subclause is not going to be the right thing to do, in the same sense that a greedy algorithm only gives the optimal solution for a certain set of problems. In this case, the greedy choice of selecting longest subclause match (over the first matching subclause) could make the wrong decision in cases where an ancestral clause of the First clause in question can't match what it's supposed to match if the First clause greedily gobbled up more of the input than it was supposed to have.

I had some examples of this, but I forget what they were. I can probably try to construct one here:

Grammar:
    A <- "a " B "monkeyapples";
    B <- "million " / "million monkey";

Input:
    a million monkeyapples

Following the correct First semantics will match the whole input; instead choosing longest subclause match over first matching subclause would cause the parser to try to match "a million monkeymonkeyapples", and it will fail to match the input.

But as I said above, I agree that almost always, the longest subclause match will match the intention of the use of ordered choice in the grammar.

I'll revisit your second-to-last comment a bit later when I have more time to think about it... you may be onto something there.

lukehutch commented 3 years ago

where a leading choice immediately replaces a trailing choice but a trailing choice can be better by a following condition. That's in itself not a shock, since Pika has practically the exact same rule (though less sophisticated).

Correct -- but this is exactly how the First operator is defined, so no, it shouldn't be a shock. The First operator doesn't have any match selection logic other than this when applied top-down. The semantics when matching bottom-up should be exactly the same as the semantics when recursing top-down.

Now, here is the catch: Squirrel goes through the choices in order whereas Pika goes through the choices in reverse order. [...] So while the rules for replacing matches look the same, swapping the operands significantly changes their meaning.

It might seem that way, but that's not true. Pika works right-to-left then bottom-to-top through the input, but it still applies each rule left-to-right from the current start point. This is the only way that later match references can actually refer to entries in the memo table that have already been memoized (and the left-to-right matching order within a single PEG rule is both mandated by the definition of PEG rules, and it is also -- almost counterintuitively -- what required the DP formulation of packrat parsing in the pika parser to have to work right to left).

So Squirrel asks "does this leading choice replace this later choice" whereas Pika asks "does this later choice replace this leading choice".

No, for both parsers, the choice as to whether or not to replace an earlier choice with a later choice only applies to a specific memo table entry, and for both parsers, the memo table is keyed with (clause, position). So the goal for both parsers (with the isBetterThan method) is to try to only replace lower memo entries in the final parse tree with higher memo entries in the same parse tree. The "only" way for this to happen for any given (clause, position) key is if the parser looped around a cycle in the grammar. And if the parser looped around a cycle in the grammar, the parser must have consumed at least one character during the course of that loop, in order to not get stuck in an infinite loop.

This is why a longer match almost always beats a shorter match. But I put "only" in quotes, because there is one subtle caveat to this: if a First clause is involved in a cycle, and if that grammar cycle is "unrolled" by matching against the input, then it's possible that a lower match of the First clause in the parse tree is supposed to match an earlier subclause of the First clause, but the higher match of the First clause in the parse tree is supposed to match a later subclause. In this case, the pika parser would not parse the input as intended, because it will not see the higher match as a better match than the lower match.

Let me see if I can create a pathological grammar and input example that illustrates this difference...

Grammar:
    A <- B / 'x';
    B <- (A 'y') / (A 'x');

Input:
    xxyx

The pika parser is not able to complete the last step of replacing the toplevel match of B = (A 'y') with B = (A 'x'), because (A 'y') has a lower subclause index than (A 'x') in the First clause of B. So it parses xxy, but leaves the final x dangling:

                                  ┌─────┐░   
                5 : A <- B / 'x'  │x x y│░   
                                  ├─────┤░   
      4 : B <- (A 'y') / (A 'x')  │x x y│░   
                                  ├─────┤░   
                       2 : A 'y'  │x x y│░   
                                  ├───┐ │░   
                5 : A <- B / 'x'  │x x│ │░   
                                  ├───┤ │░   
      4 : B <- (A 'y') / (A 'x')  │x x│ │░   
                                  ├───┤ │░   
                       3 : A 'x'  │x x│ │░   
                                  ├─┐ │ ├─┐  
                5 : A <- B / 'x'  │x│ │ │x│  
                                  ├─┼─┤ ├─┤  
              1 : [terminal] 'x'  │x│x│ │x│  
                                  │ │ ├─┤ │  
              0 : [terminal] 'y'  │ │ │y│ │  
                                   0 1 2 3 
                                   x x y x

Actually the squirrel parser handled this situation "correctly" (was able to parse the whole input), but that was because of a bug in the isBetterThan check, which used to read

        if (this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) {
            return true;
        }

but should have read

        if (this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) {
            return true;
        } else if (this.firstMatchingSubClauseIdx > other.firstMatchingSubClauseIdx) {
            return false;
        }

so that it only fell through to checking the length of subclauses if the firstMatchingSubClauseIdx was equal for both this and other.

So I "fixed this" in squirrel parser (and broke its ability to parse this grammar).

I'm curious if an argument can be made though that a "correct" left-recursion-capable parser should be able to handle this grammar. Can the Python parser you have referred to parse this example OK?

lukehutch commented 3 years ago

I think I understand the issue with First... and this is really subtle...

When matching a rule like B <- (A 'y') / (A 'x'), you have to make sure you're referring to the same A. The way that the pika parser works, the first A is the longest A that was able to be matched so far that was also followed by a 'y' (at the top of the match tree I showed in the previous comment, this A has matched xx). But the second A is the longest A that was able to be matched so far that was also followed by an 'x' (at the top of the match tree, this A is a later match of the rule, and it has matched xxy). So this is not an apples-to-apples comparison.

If we're updating a memo entry at (A, pos) in the memo table from an older match to a newer match, then we must have looped at least once around a cycle in the grammar. Each loop around the cycle must consume at least one more character than the previous loop around the cycle. But a First clause does not itself consume any characters per se -- only Seq or OneOrMore/ZeroOrMore can do that, in combination with one or more terminals.

If we assume that the clauses in the cycle other than the First clause have consumed at least one character, then the match of the First clause in this cycle has to be higher in the parse tree than the match from the same First clause that was memoized in the previous loop around the cycle. So we should simply accept the current First clause match and completely ignore which subclause index matches (i.e. not compare the first matching subclause index to the index for the current memo entry). In other words, for all PEG operators except First, a longer match beats a shorter match, but for First, the current match should always beat an older match. Then the First clause is relying on other clause types to make progress while in a cycle (growing the parse tree), so that it knows that the latest match of the First clause must always be the correct match.

However, this would allow us to create a grammar that contains a cycle directly between two First clauses that causes the pika parser to get stuck in an infinite loop, e.g.:

A <- B / 'x';
B <- A / 'x';

The current implementation of isBetterThan enforces monotonic progress: the matching subclause index must get strictly lower, or the length has to get strictly longer than the old length, for the memo entries to be updated. If First simply always matched, no matter what, then this cycle will not terminate given the simple input x.

So we still need to check firstMatchingSubClauseIdx in order to give monotonic convergence, but it could be checked if the length of the match has not increased. In other words, reverse the order of the match length check and the firstMatchingSubClauseIdx check.

    if (this.len > other.len) {
        return true;
    } else if (this.firstMatchingSubClauseIdx < other.firstMatchingSubClauseIdx) {
        return true;
    }
    return false;

But that would seem to change the semantics of the First operator, as we discussed earlier.

I can think of another scenario where this might not be the right thing to do, but it's a bit more esoteric, since it involves the NotFollowedBy operator...

Grammar:
    A <- (B !'y') / 'x';
    B <- A 'x';

Input:
    xxxy

This grammar fragment (starting with rule A) should only ever match the first 'x', because the expansion of B will keep matching x characters until it hits y, at which point (B !'y') will fail to match. So the grammar should fail over to the second subclause of the First clause of A. At the point where (B !'y') fails to match, the length of the matching subclause of A drops from 2 (xx, from (B !'y')) to 1 (x, from / 'x'), while firstMatchingSubClauseIdx increases from 0 to 1.

So length should not be the primary factor in deciding whether to replace a memo in the memo table either... (I'm still not sure what the right answer is yet...)

maxfischer2781 commented 3 years ago

So length should not be the primary factor in deciding whether to replace a memo in the memo table either... (I'm still not sure what the right answer is yet...)

Fully agree with that. And obviously I have been too eager in spinning theories, so I'll take a step back and focus on asking questions if you don't mind.

Thanks especially for clearing up my confusion about right-to-left ordering.

Can the Python parser you have referred to parse this example OK?

I'm afraid the Python parser is literally Python's parser (well, CPython's), and it's not easy to instrument for anything but the language itself since there are some layers of the interpreter setup seeping in. (The high-level description is in PEP 617 if you are interested.)

I'll try to dig up whether this can be tested well.

If we assume that the clauses in the cycle other than the First clause have consumed at least one character, then the match of the First clause in this cycle has to be higher in the parse tree than the match from the same First clause that was memoized in the previous loop around the cycle. So we should simply accept the current First clause match and completely ignore which subclause index matches (i.e. not compare the first matching subclause index to the index for the current memo entry). In other words, for all PEG operators except First, a longer match beats a shorter match, but for First, the current match should always beat an older match. Then the First clause is relying on other clause types to make progress while in a cycle (growing the parse tree), so that it knows that the latest match of the First clause must always be the correct match.

This got me wondering: Is it even possible for First to match a "worse" subclause after matching a "better" subclause? As per my understanding, First matching is a loop over its subclauses with the first memoized subclause winning.
Say we have choice <- A / B, where both A and B can match but B can also match again via recursion. Once B matches it activates choice, which checks A (no match) then B (match) causing the match choice = B. Then, A matches and activates choices, which checks A (match) and stops, causing the match choice = A. Now, once B matches again it activates choice, which checks A (match from the previous cycle) and stops, causing the match choice = A again.

Basically, since choice does not "know" in which cycle it can only look at the latest match of each subclause, not at the current match. But the latest match might actually be from a previous cycle. Is my interpretation correct? Would it be sufficient if First gets informed which subclause triggered it?[*]

[*] I've been pondering this only from a complexity standpoint so far. It seemed like a waste to have First potentially check all its subclauses to find which one triggered it.

However, this would allow us to create a grammar that contains a cycle directly between two First clauses that causes the pika parser to get stuck in an infinite loop,

This is something I didn't find specified in the paper, so I guessed: in my Pika parser, references themselves are clauses which merely refer to the clauses of the corresponding rule as subclauses. So given a rule A <- B / 'x';, the B is a clause that adds a level of indirection to the content of the rule B. Even if the rule B is a First and thus uses index/always matching, the reference B still enforces monotonic progress.

Am I correct to assume that's not the case in the reference Pika, i.e. references are directly resolved to their rule's clauses?

lukehutch commented 3 years ago

This got me wondering: Is it even possible for First to match a "worse" subclause after matching a "better" subclause? As per my understanding, First matching is a loop over its subclauses with the first memoized subclause winning.

Yes, it's possible.. this is what this grammar example was supposed to show, in my previous comment:

Grammar:
    A <- (B !'y') / 'x';
    B <- A 'x';

Input:
    xxxy

The First's first subclause will match x, then xx, then it will fail to match xxx (because it is followed by y), so it will fail over to matching the 'x' subclause, which will match just the first x of the input. So it goes from matching xx to matching x.

And actually as I have continued to think about this example, I found an apparent contradiction in what the "optimal" behavior of a PEG parser should be, and I don't know how to resolve this...

In the following grammar, the left-recursive part (rules B and C) should try to match as many x characters as possible (i.e. xxx), growing the tree downwards. Then A can match the final y:

Grammar:
    A <- B 'y';
    B <- C / 'x';
    C <- B 'x';

Input:
    xxxy

However if B and C were to try to match even one extra x (i.e. xxxx) before the y, the rule B would go from matching to no longer matching, and then A could no longer match B 'y'. Therefore, when expanding a left-recursive tree, you want to only keep expanding the tree as long as the match keeps improving (i.e., in the absence of First, you keep growing the parse tree downwards, once per loop around the left recursive cycle, consuming at least one character for each loop around the cycle). As soon as an attempt to match one more time around the loop results in a match that is no better than the previous match (i.e. no more progress is made), or a non-match (which in this case would be trying to match xxxx), then you stop trying to grow the left-recursive tree, and discard the result of the last match attempt.

But now consider this example:

Grammar:
    A <- B "xxy";
    B <- C !'y' / 'x';
    C <- B 'x';

Input:
    xxxy

This is a combination of the two previous grammars. What is happening here is B and C try to keep matching more and more x characters until one is followed by a y character, at which point the last match fails, and the First clause ends up selecting the second subclause, / 'x'. This grammar should match the input as A = ((B = 'x')"xxy").

But this match is only possible if the clause C !'y' is allowed to go from matching xx to no longer matching xxx (because xxx is followed by y). This contradicts the logic found above, that we have to keep expanding left-recursive matches as long as the matches keep improving, until the match no longer improves or we get a mismatch -- and then we discard the last match result, so that we only keep the longest or best possible match. If we do this here, then the C !'y' clause will be left matching xx, and the / 'x' clause will never be triggered, so rule A will never match.

Does that make sense? The screwy part of these rules is the NotFollowedBy clause, !'y', since that can cause a match to suddenly become a non-match as the match length gets longer, but it's probably possible to construct a situation like this without that clause type.

I'm really puzzled by this... I have no idea how to reconcile these two things. Allow matches to become non-matches only if they are descendant clauses of First clauses? That's starting to get really complicated..

Say we have choice <- A / B, where both A and B can match but B can also match again via recursion. Once B matches it activates choice, which checks A (no match) then B (match) causing the match choice = B. Then, A matches and activates choices, which checks A (match) and stops, causing the match choice = A. Now, once B matches again it activates choice, which checks A (match from the previous cycle) and stops, causing the match choice = A again.

Basically, since choice does not "know" in which cycle it can only look at the latest match of each subclause, not at the current match. But the latest match might actually be from a previous cycle. Is my interpretation correct? Would it be sufficient if First gets informed which subclause triggered it?[*]

[*] I've been pondering this only from a complexity standpoint so far. It seemed like a waste to have First potentially check all its subclauses to find which one triggered it.

Well this was sort of my operating assumption too, which is why I recorded firstMatchingSubclauseIdx inside each Match node. This would just be set to zero for non-First clause matches, but would indicate which subclause matched for First clause types.

But you are right that by looking in the memo table, the current clause being parsed can only know what the latest match is of a given rule. And for the example I gave with grammar A <- B / 'x'; B <- (A 'y') / (A 'x'); and input xxyx, the problem was exactly what you describe: the (A 'y') match came from a previous loop around the grammar cycle, and the (A 'x') match came from the most recent loop around the grammar cycle (so the A reference was for a lower or shorter match, and a higher or longer match, respectively) -- but the First clause for B had no way to tell that. Consequently it just selected the first match, which is (A 'y'), and so it never selected the correct match, (A 'z').

Actually this is another example of what I was talking about, which a memo table entry going from matching to non-matching. Technically for the most recent match of A (which was xxy, (A 'y') should have gone from matching to no longer matching, since xxyy does not match the input. So this is an example of where a subclause of a First clause should allow memo table entries to go from matching to no longer matching. (But this doesn't work for clauses that are not subclauses or descendant clauses of non-First clauses, for the reason I gave earlier, when I discussed the first half of the contradiction.)

This is something I didn't find specified in the paper, so I guessed: in my Pika parser, references themselves are clauses which merely refer to the clauses of the corresponding rule as subclauses. So given a rule A <- B / 'x';, the B is a clause that adds a level of indirection to the content of the rule B. Even if the rule B is a First and thus uses index/always matching, the reference B still enforces monotonic progress.

Am I correct to assume that's not the case in the reference Pika, i.e. references are directly resolved to their rule's clauses?

In the pika parser, yes, I replaced RuleRef nodes in the grammar with direct references to the top clause of the referenced rule. In the squirrel parser, I decided to only memoize rules, not every individual clause within the clause tree of a rule, so instead for each RuleRef clause, before parsing I cached a direct reference to the referenced rule's top clause in the RuleRef node (so that the top caluse doesn't have to be looked up by name every time that rule is referenced).

But referencing another rule only enforces monotonic progress if that other rule consumes at least one character.

lukehutch commented 3 years ago

(The high-level description is in PEP 617 if you are interested.)

Interesting -- they added the following extension to PEG, maybe because of some of the First weirdness that we have discussed in this thread:

~: Commit to the current alternative, even if it fails to parse.

rule_name: '(' ~ some_rule ')' | some_alt

In this example, if a left parenthesis is parsed, then the other alternative won’t be considered, even if some_rule or ‘)’ fail to be parsed.

I finally figured out the correct logic for First, after a lot of convoluted thought!

If we disallow old matches in the memo table being replaced with new mismatches (i.e. where the mismatch would be higher in the parse tree than the match, after an extra loop around a grammar cycle), then:

If the firstMatchingSubclauseIdx of a First clause match decreases, then the length of the match must have increased compared to a previous loop around the grammar cycle (since at the earlier point, there wasn't enough input matched to match the earlier subclause of the First clause). So a decrease in firstMatchingSubclauseIdx implies that the length of the match must have increased.
If there's a new match, and firstMatchingSubclauseIdx stays the same, then length of the match of the First clause must have increased, because we require each loop around the cycle to consume at least one character in order to guarantee termination (via monotonic progress).
If firstMatchingSubclauseIdx increases, then the new matching subclause could be of any length, shorter or longer, and it may not even evaluate subclauses in the same grammar cycle.

Case 3 is the problematic case, since we can't only accept longer new matches as better than shorter old matches. However, since the new match can be any length in just case 3 out of these three cases, and since in the other two cases, whether or not the new match of the First clause is better than the old match in the memo table for the First clause can be determined just by looking at the match length, we can create the following condition:

newMatch.isBetterThan(oldMatch) = newMatch.firstMatchingSubclauseIdx > oldMatch.firstMatchingSubclauseIdx
        || newMatch.len > oldMatch.len;

This condition has to be monotonic, since if case 3 is triggered and firstMatchingSubclauseIdx increases, then in future after one or more further loops around the grammar cycle, if firstMatchingSubclauseIdx stays the same or decreases, case 2 or 1 will apply. Therefore, the grammar cycle is guaranteed to terminate.

Interestingly the first condition, newMatch.firstMatchingSubclauseIdx > oldMatch.firstMatchingSubclauseIdx, is the opposite of what I was testing before based on an overly-simplistic understanding of First semantics, i.e. newMatch.firstMatchingSubclauseIdx < oldMatch.firstMatchingSubclauseIdx.

I would be curious to know how this change works with your unit tests.

lukehutch commented 3 years ago

Note that even with the change suggested in the previous comment, the pika parser (at least as I implemented) will not be able to match some grammars, and this is a result of the level of granularity of memoization that I chose, in combination with the decision that matches should never become mismatches in the memo table (so that expanding left recursion never "overshoots"). For the pika parser, every Clause node anywhere in the grammar is memoized. (In the squirrel parser, I chose only to memoize matches of Rule instances, not Clause instances.)

So given this grammar and input:

Grammar:
    A <- B / 'x';
    B <- (A 'y') / (A 'x');

Input:
    xxyx

the pika parser matches only xxy (because the first subclause of B's First clause, (A 'y') needs to go from matching to no longer matching in the memo table for B to fail over to the second subclause to consume the very last character of the input, and the rule is that matches don't become non-matches in the memo table). So actually as I mentioned previously, the (A 'y') in the memo table uses an A match from lower down in the parse tree than the A in the final (A 'x') that we want to match.

However in the squirrel parser, I decided to memoize only rules, not clauses. So (A 'y') and (A 'x') within the same rule always refer to the same A, and the squirrel parser can parse this grammar. This coincides much more closely with human intuition (not to mention that reducing the amount of memoization speeds up parsing).

I think I now have a good enough understanding off all of the weirdness between left recursion, PEG's greedy behavior, top-down vs. bottom-up, guaranteeing monotonic progress, and interactions between First and other clause types, to write this up into a new paper. I'll keep you posted if/when I get this done.

lukehutch commented 3 years ago

Dangit, I'm wrong again... I found a simple example that my suggested new isBetterThan doesn't work for. The following grammar causes the parser to get into an infinite loop.

Grammar:
    A <- (A 'y') / 'x';

Input:
    xy

A will not match initially, then (expanding the grammar cycle upwards) will match x (using the second subclause of the First clause), then it will match xy (using the first subclause of the First clause), then it will try to match xyy and fail, so it will fall over to the second subclause of the First clause, try to match x, and succeed, because of the addition to isBetterThan of newMatch.firstMatchingSubclauseIdx > oldMatch.firstMatchingSubclauseIdx.

I think the correct logic for First should be: "follow the semantics of First (select the first matching subclause), but don't update a memo in the memo table unless the new match is longer than the old match".

This is exactly what you tried. It reduces to simply:

newMatch.isBetterThan(oldMatch) = newMatch.len > oldMatch.len;

If a memo table entry never goes from matching to mismatching, and it's only ever updated if the match gets longer, then the firstMatchingSubclauseIdx of a match of a First clause can never increase.

This actually matches the intended semantics of the recursive descent behavior of the First operator when there is no left-recursive cycle involved: the first matching subclause consumes as many characters as it can, but no more than that, and you never fail over to the next subclause because you attempted to match more characters than an earlier subclause can handle.

So in conclusion, I think you were right to suggest simply using newMatch.len > oldMatch.len as the memo replacement rule.

maxfischer2781 commented 3 years ago

Using just newMatch.isBetterThan(oldMatch) = newMatch.len > oldMatch.len; seems tempting but also tricky. Are you sure this works miss-behaved grammars? Looking at your own example from above:

Grammar:
    A <- "a " B "monkeyapples";
    B <- "million " / "million monkey";

The B clause is not well-behaved under PEG – the second option should never be matched. Wouldn't just the "new match is longer" be stuck on the second (wrong) option since the first option cannot match longer?

Looking at what we've discussed here, I feel like we are circling around trying to solve two problems at once:

First matching again at the same position due to iterating its options, and
First matching again at the same position due to recursion.

And there is some strain from that – e.g. "simply using newMatch.len > oldMatch.len" solves the recursion, not the iterating. Do you think it's possible to separate the two? I think your cases 1-3 go into that direction.

I still haven't quite worked out whether the ordering of Pika's matches is reliable enough to say newMatch.firstMatchingSubclauseIdx < oldMatch.firstMatchingSubclauseIdx must imply a circle and newMatch.len > oldMatch.len; must apply then. My show-stopper is that the iteration is stuck on an earlier memoised match of a lower-index subclause, never seeing a new higher-index subclause (thus getting stuck). Do you think telling clauses which sublcause triggered them would be sufficient in this case?

lukehutch commented 3 years ago

I'm starting to think it's not possible to reconcile the two desired behaviors of First. Given a left-recursive rule E, either you match as many nested instances of E as possible, but no more than that (i.e. you stop once the match can't be improved, or changes from a match to a non-match, but you keep the best match, and discard the result of the last match attempt), or you allow one mismatch beyond the best match (i.e. you stop once the match can't be improved, and you keep the result of the last match attempt, which may be a mismatch, even though the previous attempt produced a match). Some grammar examples I gave require the first behavior to parse, and some grammar examples I gave require the second behavior. You can't pursue both behavioral options to see which option is the optimal option for a given grammar, because that would make the parser semantics nondeterministic, and that may lead to exponential degradation of parsing performance on a deterministic Turing machine.

There is one principle that is emerging as most important in my mind, which is that PEG is relentlessly greedy, and with the exception of First subclauses, only ever considers one parsing option in a given start position. If we think about how a recursive descent PEG parser works in the non-left-recursive case, and try to make the left-recursive behavior mirror the behavior of the non-left-recursive case, we'll hopefully produce a parser that doesn't cause any surprises for the user. And because the PEG parser is greedy, but will only result in nodes in the parse tree that matched, not mismatched, I think only the first of the two behaviors I described in the previous paragraph match the standard recursive descent behavior. That means there will be inputs that cannot be parsed given a certain grammar (but that is true of any parser type).

So the monkeyapples example you cited is a good example to look at. It's not left-recursive at all, so a standard PEG parser can parse it. And any standard recursive descent parser should correctly parse the input a million monkeyapples, because the first subclause of B will parse, so it will never look at the second (wrong, and longer) option.

I just tested this grammar and it also parses fine with both the pika parser and the newer squirrel parser, using the logic newMatch.isBetterThan(oldMatch) = newMatch.len > oldMatch.len;. I agree that that seems strange, since the second subclause of the First clause, / "million monkey", is longer, so it seems like that should be selected rather than the first subclause. However, isBetterThan is only called when deciding whether or not to update an entry in the memotable for a specific clause at a specific start position, when there is already an older match in the memo table (implying that we looped around a cycle in the grammar at least once). But newMatch is a new match determined by applying the logic of the First clause, which doesn't select the longest clause, it still selects the first matching clause, in left-to-right subclause order. So if the second subclause is never selected because the first subclause matches, then the second subclause (the longer match) will never even be considered. That's what happens here, because the first subclause is actually a prefix of the second subclause. The isBetterThan test is always applied after the regular First matching logic.

Let me know what you think, given that clarification.

lukehutch commented 3 years ago

One more thought, I wanted to explicitly connect the first half and the second half of my last comment, since it seems like I'm talking about two separate topics...

isBetterThan only returns true if there was already a match in the memo table (and the new match beats the old match), which only happens if you have looped around a left-recursive cycle at least once.
Consider the case where the parser has looped around a left-recursive grammar cycle at least once, matching each time. The First clause matches with subclause i_prev the previous time around the loop, and matches with subclause i_curr this time around the loop. Then:
1. If i_curr < i_prev, then subclause i_curr must have required a longer match length than was available previously, in order to satisfy all its match conditions. So newMatch.len > oldMatch.len must be true, and isBetterThan will return true.
2. If i_curr == i_prev, it's correct to check if newMatch.len > oldMatch.len.
3. It's impossible to reach the situation i_curr > i_prev unless it's possible for the matching subclause with index i_prev to match in the previous cycle, and then no longer match in the current cycle.

So this is where we get back to the first two paragraphs of my previous comment. If we disallow an entry in the memo table to go from indicating a match in the previous loop around a grammar cycle to no longer matching in the current loop around the grammar cycle (in order to try to replicate the behavior of standard recursive descent parsing in the left recursive case), then we don't have to care about whether the length increases or decreases in case (iii), because case (iii) is impossible.

Therefore, as long as a match in the memo table can't be replaced with a non-match, then just checking if newMatch.len > oldMatch.len is sufficient to handle matches of all clause types, including First.

lukehutch commented 3 years ago

So with this new isBetterThan criterion, the squirrel parser seems to handle any sort of left-recursive grammar that I can throw at it. (The pika parser still has some issues with left-recursive matches of First in some grammars, because I'm memoizing at the level of individual clauses, as opposed to at the coarser level of rules rather than clauses, as with the squirrel parser -- I discussed this in an earlier comment.)

The parsing logic in the squirrel parser keeps getting clearer and simpler too... the latest result should hopefully be easier to understand than the original version I linked to. (I incorporated isBetterThan directly into the match method in this version, because isBetterThan is now such a simple test.) It's also blazingly fast, much faster than the pika parser, and is linear in performance for any grammar or input that I have thrown at it. (It doesn't have any error recovery yet, so the pika parser is still superior if you need optimal error recovery.)

https://github.com/lukehutch/squirrelparser/blob/main/src/main/java/squirrelparser/parser/MemoEntry.java

maxfischer2781 commented 3 years ago

Therefore, as long as a match in the memo table can't be replaced with a non-match, then just checking if newMatch.len > oldMatch.len is sufficient to handle matches of all clause types, including First.

This seems indeed to be the case. Interestingly enough, it seems to also match the rule from Medeiros et al.
"Fortunately, it is sufficient to increase the bound until the size of the matched prefix stops increasing."

(The pika parser still has some issues with left-recursive matches of First in some grammars, because I'm memoizing at the level of individual clauses, as opposed to at the coarser level of rules rather than clauses, as with the squirrel parser -- I discussed this in an earlier comment.)

This is something I had not understood earlier, so it's good to see you pick it up again.

What I find interesting is that for the cases where rule and clause memoization are equivalent, Pika seems to naturally have the "bounded left recursion" behaviour proposed by Medeiros. I assume that with just rule memoization, it might also have the "bounded left recursion" behaviour for choices/First. Interestingly enough, Pika even has the same right-associativity bias (e.g. in E <- E '+' E / num) of Medeiros/Warth even though it matches right-to-left.

Do you think it would be possible to have a "rule memozation Pika"? I assume it would be possible to match the clauses top-down, but the rules bottom-up, would you agree? Even if that would be slower than pure top-down, Pika's simplicity is a very attractive feature.

lukehutch commented 3 years ago

Therefore, as long as a match in the memo table can't be replaced with a non-match, then just checking if newMatch.len > oldMatch.len is sufficient to handle matches of all clause types, including First.

This seems indeed to be the case. Interestingly enough, it seems to also match the rule from Medeiros et al. "Fortunately, it is sufficient to increase the bound until the size of the matched prefix stops increasing."

Hmm, I changed my mind again -- in spite of my 3-part breakdown of cases, this is incorrect for the case of First clauses! And I bet that Medeiros et al.'s parser (and Warth's parser, for that matter) can't parse some of the grammars I posted earlier in this thread.

It really is valid for a First clause to fail over to a shorter matching subclause. I finally figured out how to reconcile this with the newMatch.len > oldMatch.len condition, however.

If we don't ensure that newMatch.len > oldMatch.len for all clause types except First, then the upwards expansion of the parse tree for a left recursive cycle can become an infinite loop. In fact it's quite easy to construct a grammar that triggers an infinite loop (i.e. an infinitely deep parse tree), if First is allowed to select whichever subclause currently matches.

However, a couple of the grammars I gave earlier in the thread produces the wrong parses in both the pika parser and the newer squirrel parser. I fixed one of them in the squirrel parser, but the pika parser is broken, and I previously attributed this to the pika parser memoizing at the level of individual clauses. (To answer your question, yes, you could memoize just at the level of rules rather than clauses in the pika parser, with very simple code changes, and I will probably make that change in the pika parser in future, since in my testing it doubled parsing speed of the new squirrel parser, and it may more than double speed in the pika parser, since hashtable lookup is slower than recursing through a few non-memoized grammar clauses.)

But then I realized that granularity of memoization is not the problem with the pika parser per se. The problem is that a First clause needs the latest match of each subclause, not the best match (i.e. not whatever was memoized, given the newMatch.len > oldMatch.len rule). So the way to solve this is to store in a memo entry not just the best match, but also the latest match. (Both these matches will be the same if the latest match was the best so far.) Then any First clauses will look at the latest match, whereas other clause types will look at the best match. The reason this is needed is that the latest match can go from matching to no longer matching as you continue to expand a left recursive cycle upwards, but the best match cannot, because of the newMatch.len > oldMatch.len rule. (Incidentally probably the NotFollowedBy clause type also needs to look at the latest match, not the best match, because that is the other clause type that actually depends upon subclauses not matching to function correctly.)

Then the second point I realized is that the subclauses of a First clause must each individually satisfy the monotonic progress rule, i.e. newMatch.len > oldMatch.len, for their best match (but not for their latest match). If each subclause follows that rule, then it doesn't matter which subclause you switch to, or whether that match gets longer or shorter, the upwards expansion will still monotonically terminate. So I think the logic for First would be "check the latest match of each subclause in term, then when a matching subclause is found, use the best match (not the latest match) for that same clause as the match for the whole First clause".

The third point I realized is that to have both the best match and the latest match accessible to a First clause, you need to memoize all the subclauses of a First clause, even if it's in the middle of a tree of clauses inside a rule, when you're only memoizing rules.

There are a couple of special cases -- what if you have a First clause as a subclause of another First clause? And what if a subclause of a First clause is a RuleRef, e.g. A <- B / C? (In this case, what if B is a First clause, and what if it's not?) I'm trying to figure out the simplest correct way to handle this, implementationally.

What I find interesting is that for the cases where rule and clause memoization are equivalent, Pika seems to naturally have the "bounded left recursion" behaviour proposed by Medeiros. I assume that with just rule memoization, it might also have the "bounded left recursion" behaviour for choices/First.

Correct, the Pika parser does something very similar to Medeiros' proposed behavior. And I was looking at the issue incorrectly by blaming this on memoization alone. The real issue is that First clauses need access to the latest value of a match attempt (even if that's a mismatch), not just the best value of a match.

Interestingly enough, Pika even has the same right-associativity bias (e.g. in E <- E '+' E / num) of Medeiros/Warth even though it matches right-to-left.

Yes -- so does the squirrel parser. It's super weird, but if you work through an example like 1+2+3+4, it actually makes a lot of sense.

I thought it would make sense to try to make grammars of ambiguous associativity like the example you gave act as left-recursive by default. And actually my parser was then no longer able to parse in a precedence-observing way, using a precedence-climbing grammar. For example, with a grammar supporting the standard arithmetic operators, the modified parser (which prioritized collecting terms to the left before expanding matches to the right) was not able to parse 1+2+3*4, because it would grab 1+2+3 and then not know what to do with *4. There's probably a way to fix this (I would imagine by combining an algorithm like the operator-precedence parser into the left recursion expansion loop) -- but I don't think it's worth it.

Do you think it would be possible to have a "rule memozation Pika"? I assume it would be possible to match the clauses top-down, but the rules bottom-up, would you agree?

Yes, once I figure out the simplest possible way to incorporate this new First logic (and the new memoization logic, whereby two matches are stored per memo entry, not one), then I'll try to backport the ideas onto the pika parser. That might be a ways off though, since I'm more focused on the new squirrel parser right now. (The ideas are directly applicable to the pika parser though, since the squirrel parser and the pika parser are in some ways mirror images of each other, and they share a ton of code similarities.)

Even if that would be slower than pure top-down, Pika's simplicity is a very attractive feature.

I'm glad you like the pika parser's simplicity -- and it seems simple in the paper (maybe it was even simple to re-implement), but there was quite a lot of complexity in the setup that I didn't cover in the paper (such as the topological ordering of clauses, resolving RuleRefs to direct references to clauses, preserving AST labels on subclauses, etc.) -- this stuff made the final reference implementation a bit of a mess. The new squirrel parser is much cleaner and simpler in this respect, and it's dramatically faster. The only reason the pika parser still shines is that it has arguably optimal error recovery. (I still haven't figured out error recovery for the new squirrel parser, but it almost certainly won't be as good or as robust as the pika parser, since that's as good as it gets.)

lukehutch commented 3 years ago

I thought of a more concrete way to think about left recursion, and now I'm starting to think that the problem really is underspecified.

I wrote a function:

static String unrollGrammar(String ruleName, String ruleClauses, int n) {
    var buf = new StringBuilder();
    for (int i = n - 1; i >= 0; i--) {
        buf.append(ruleName + i + "<-"
                + ruleClauses.replace(ruleName, i == 0 ? "'-'" : ruleName + (i - 1))
                + ";\n");
    }
    return buf.toString();
}

The idea is to take a grammar rule like A <- (A 'x' !'y') / 'x', and unroll it the specified number of times. At the bottom level, I just replaced A with something that won't match the input, in this case '-'.

Critically, this removes left recursion -- it just expands left recursion into standard recursion to a fixed depth. That means that standard recursive descent parsing will work without any special left recursive cycle handling.

Now observe what happens if I try to parse xxxxy, with the unrolling depth ranging from 1 to 10:

var input = "xxxxy";
for (int i = 1; i < 10; i++) {
    var ug = unrollGrammar("A", "a:((A 'x' !'y') / 'x')", i);
    System.out.println(ug);
    var grammar = MetaGrammar.parse(ug);
    tryParsing(grammar, input);
}

A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+1 : "x"

A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+2 : "xx"
  └─a : 0+1 : "x"

A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+3 : "xxx"
  └─a : 0+2 : "xx"
    └─a : 0+1 : "x"

A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+1 : "x"

A4<-a:((A3 'x' !'y') / 'x');
A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+2 : "xx"
  └─a : 0+1 : "x"

A5<-a:((A4 'x' !'y') / 'x');
A4<-a:((A3 'x' !'y') / 'x');
A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+3 : "xxx"
  └─a : 0+2 : "xx"
    └─a : 0+1 : "x"

A6<-a:((A5 'x' !'y') / 'x');
A5<-a:((A4 'x' !'y') / 'x');
A4<-a:((A3 'x' !'y') / 'x');
A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+1 : "x"

A7<-a:((A6 'x' !'y') / 'x');
A6<-a:((A5 'x' !'y') / 'x');
A5<-a:((A4 'x' !'y') / 'x');
A4<-a:((A3 'x' !'y') / 'x');
A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+2 : "xx"
  └─a : 0+1 : "x"

A8<-a:((A7 'x' !'y') / 'x');
A7<-a:((A6 'x' !'y') / 'x');
A6<-a:((A5 'x' !'y') / 'x');
A5<-a:((A4 'x' !'y') / 'x');
A4<-a:((A3 'x' !'y') / 'x');
A3<-a:((A2 'x' !'y') / 'x');
A2<-a:((A1 'x' !'y') / 'x');
A1<-a:((A0 'x' !'y') / 'x');
A0<-a:(('-' 'x' !'y') / 'x');

└─a : 0+3 : "xxx"
  └─a : 0+2 : "xx"
    └─a : 0+1 : "x"

The parser will match x, then xx, then xxx, then will fail to match xxxx (because it's followed by a y), so it fails over to matching x again. Then if the recursion depth is even deeper than that, the entire process repeats again.

So let's say we define the effect of handling left recursion as the result of computing the rule:

A <- A∞ / ... / A5 / A4 / A3 / A2 / A1;

then we get an undefined result, because in fact even standard recursive descent parsing ends up looping around the same series of matches again and again.

So what's the correct match for this example, x or xxx? Arguably the correct match is not xx. But should the cycle stop one short of the mismatch (i.e. at xxx, right before the first time a match is repeated), or should it stop the first time there is a mismatch, i.e. with the result x?

I'm really stumped here :-)

maxfischer2781 commented 3 years ago

What you describe seems similar to what Medeiros et al. call "Bounded Recursion".

# left-recursive PEG definition
E <- E '+' num / num
# bounded recursion PEG expansion
E0 <- fail
E1 <- E0 '+' num / num
E2 <- E1 '+' num / num
E3 <- E2 '+' num / num
....

They expand until a match no longer increases (sounds familiar?) or fails, then "backtrack" to the longest successful match. (So if E3>=E4 or E4~>fail, then E=E3).

In your example, that would match xxx.

lukehutch commented 3 years ago

What you describe seems similar to what Medeiros et al. call "Bounded Recursion".

Yes, it's related to that, and it influenced my thinking. But I had a realization tonight that increasing length is not the correct objective to ensure monotonic progress. This is at least in part because once you start adding First into the mix, sometimes you want to backtrack to the longest successful match, and sometimes you actually want to accept the mismatch (after the longest match) as the "correct interpretation" of the left recursive clause -- otherwise you won't get the expected behavior of First (because First actually relies on some subclauses being able to switch from matching to no longer matching when left recursive cycles are expanded).

I wrote up my idea for a better system than expanding until the match length no longer increases, but I didn't click "Comment" yet, because I'm testing the idea out in practice, and it needs a little more refinement. Will add the comment soon.

lukehutch commented 3 years ago

(The gist of the new idea though is that it's not about increasing match length monotonically, it's about ensuring that the same match length is not encountered more than once for the same clause at the same start position, since this implies that the parser got into an infinite loop when expanding the parse tree upwards from the fixed point or seed.)

lukehutch commented 3 years ago

I discovered a strange new grammar example. This grammar is supposed to match an x, followed by zero or more ? characters, followed by a y:

Grammar:
    A <- B 'y';
    B <- ((B / "x?") '?') / 'x';

Inputs:
    xy
    x?y
    x??y
    x???y

There are two possible options for how a parser could interpret these rules:

The B rule keeps expanding the left-recursive rule upwards from the seed match of "x?", implementing the equivalent of the regexp x[?]+, until all ? characters have been consumed, then tries to match (B '?') one more time, and fails, so it fails over to / 'x'. The rule B never goes from matching to mismatching, but the match does get shorter due to First in the last step. Therefore, of the inputs, only xy can be matched.
Same as the previous possibility, but at the last stage, the final failover step is ignored, because it would shorten the match length of B -- so the longest previous match is used. Therefore, all of the options will match.

A truly greedy parser would pick option 1 (maybe this is closer to PEG semantics?), but a "more useful but maybe sometimes broken" parser would pick option 2.

In your opinion, should this grammar match the inputs or not?

lukehutch commented 3 years ago

I ran a bunch of experiments with a parser modified to require unique match lengths for each step of the expansion of a left recursive cycle, rather than requiring monotonically increasing lengths.

It turns out this new heuristic almost always gives exactly the same result as the original heuristic of requiring monotonically increasing lengths. But for cases where there was a mismatch (e.g. for the grammar in the previous comment), I couldn't say that the behavior of the new heuristic (which was the "fully greedy" behavior, number 1) consistently matched my intuition of how the parser should behave -- even though the new heuristic produced much better-defined semantics for First clauses in a left recursive cycle.

It's easier to explain and understand the monotonic length increase heuristic. But this heuristic should come with some big caveats:

A subclause of a First clause cannot match in one loop around a left recursive grammar cycle, and then not match in the next loop around the cycle.
Consequently, a First clause cannot match via subclause number i in one loop around a left recursive grammar cycle, and then match via a later subclause numbered j > i in a later loop around the cycle.
This means the index of the matching subclause of a First clause can only monotonically decrease as the match length increases with each subsequent loop around a left recursive cycle.

So I guess it's probably best to go back to requiring increasing length of match, with the above caveats.

The simplest "band aid" fix for the pika parser would be to memoize whole rules, rather than individual clauses. I don't know if I will have time to do that anytime soon, because I'm hard at work on the new parser. But you would need the "seedParentClauses" to point to rules, not clauses, and you would need to modify the code to only do a memo table lookup from a RuleRef rather than from each clause. (In the current reference implementation of the pika parser, RuleRef instances are turned into direct references, and I would probably reverse this decision to make it easy to find where recursion refers to a new rule, rather than a clause.)

maxfischer2781 commented 3 years ago

In your opinion, should this grammar match the inputs or not?

Yes, in the same way that the same grammar substituting B <- B "?" / "x" would match – even if in the end we fail matching "B plus a "?"", we fall back to the previous longer match.

I've pondered precisely that over the last few days and think my feeling comes down to actually sticking with PEG's greediness. Ultimately, expanding the recursion to the longest match at the expense of earlier matches feels in line with gobbling up the input. I admit though that this seems again to be a judgement call, shifting "should we be always greedy?" to "what does it mean to be greedy?".

The simplest "band aid" fix for the pika parser would be to memoize whole rules, rather than individual clauses. I don't know if I will have time to do that anytime soon, because I'm hard at work on the new parser.

I have a rough idea how to approach that. Let's hope I can find the time to test it; will let you know if I do.

lukehutch commented 3 years ago

I've pondered precisely that over the last few days and think my feeling comes down to actually sticking with PEG's greediness. Ultimately, expanding the recursion to the longest match at the expense of earlier matches feels in line with gobbling up the input.

I agree with this up to the very last step. In the very last step, should the greediness of a parser consume enough matches that it causes a mismatch? That's what this example was about:

Grammar:
    A <- (B !'y') / 'x';
    B <- A 'x';

Input:
    xxxy

In a "fully greedy" parser, the cycle between A and B would consume all three x characters. Then (B !'y') will fail to match, and the final result of matching A against the input will fail over to matching just a single x character.

In a "fully greedy except for the very last step" parser, (B !'y') will match xx, but will never match xxx because that would cause (B !'y') to go from matching two characters to not matching. Therefore A will never fail over to matching just the first character. Therefore the final result will be xx, not x.

This was probably a better example to use to ask the question of which type of greediness seems most appropriate... let me know what you think.

I guess one way to think about it is to replace the left recursion with OneOrMore. It's semantically equivalent (I think)...

    A <- ('x'+ !'y') / 'x';

This rule would follow the "fully greedy" behavior.

But then this comes back to the question of granularity of memoization. If we memoize not just rules but every clause in the grammar, then for the original grammar, the subclause (B !'y') will be memoized separately, and that subclause will never go from matching xx to not matching xxx, so the failover will never happen, and we get the second "fully greedy except for the last step" behavior.

I have a gut feeling though that even if memoization happens at the level of rules, not clauses, there may be a way to construct a larger-scale grammar structure out of rules that exhibits this same problematic overly-greedy memoization behavior (or rather, the behavior that memoized matches are never allowed to become mismatches, which may be needed to match some pattern involving First). But I haven't been able to come up with a good example of this yet.

maxfischer2781 commented 3 years ago

This was probably a better example to use to ask the question of which type of greediness seems most appropriate... let me know what you think.

My preference is for "fully greedy except for the very last step" behaviour.
In general, I would expect E <- (E 'x') / 'x' to be equivalent to E <- x* (Pika uses this equivalency in reverse). It seems in your example, we can collapse A to A <- (A 'x' !'y') / 'x', similar to E, and I would expect the behaviour not to change due to collapsing rules.

I do have a draft for the "top-down clauses, bottom-up references" matching setup now. That still has the nice, simple rules of Pika's matches – most notably there aren't any arcane loops as in Medeiros or Warth.
What is giving me a headache is compressing the canMatchZeroChars, Terminal and seedParentClauses properties from clauses to their rules. I'm not 100% sure if that will be possible without losing the simplicity again.

lukehutch commented 3 years ago

Oh, you're right, I forgot about those properties!

canMatchZeroChars doesn't need to be propagated -- it is determined for all clauses using a depth-first search of the graph of clauses. Just assume that the value for the rule is the value for the topmost clause of the rule.

isTerminal should be true for any rule that has no RuleRefs in its descendant clauses, i.e. where all descendants lead to a terminal but not another rule.

seedParentClauses: this is a little more complicated. You nede seedParentRules. For this, produce a reverse mapping from each clause in the grammar to the rule that it is part of. Then use that map to map from seedParentClauses to seedParentRules.

There is a little bit of extra complexity for at least a couple of the above, depending on whether or not you resolve RuleRefs to direct references to rules, like I do in the reference pika parser (since this removes RuleRefs from the grammar) -- if you do this, you will need to do the above steps before RuleRefs are turned into direct references. The alternative is to leave RuleRefs in place as clauses, and turn their match() method into a memo table lookup. (I didn't do that for speed reasons, but it makes some of the setup code a lot more complicated!)

On Wed, Jun 30, 2021 at 7:40 AM Max Fischer @.***> wrote:

This was probably a better example to use to ask the question of which type of greediness seems most appropriate... let me know what you think.

My preference is for "fully greedy except for the very last step" behaviour. In general, I would expect E <- (E 'x') / 'x' to be equivalent to E <- x* (Pika uses this equivalency in reverse). It seems in your example, we can collapse A to A <- (A 'x' !'y') / 'x', similar to E, and I would expect the behaviour not to change due to collapsing rules.

I do have a draft for the "top-down clauses, bottom-up references" matching setup now. That still has the nice, simple rules of Pika's matches – most notably there aren't any arcane loops as in Medeiros or Warth. What is giving me a headache is compressing the canMatchZeroChars, Terminal and seedParentClauses properties from clauses to their rules. I'm not 100% sure if that will be possible without losing the simplicity again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lukehutch/pikaparser/issues/32#issuecomment-871414310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGGCKMB3VOKBS5HS32QGB3TVMNGFANCNFSM456PX5YQ .

cal101 commented 2 years ago

Dear Luke Hutchinson, dear Max Fischer!

Today I stumbled on the Pika paper and it made an interesting and impressive read.

Because you made the source code available here (many thanks to that!) I found this discussion. I can't say that I was able to completely follow the arguments but it is an interesting journey!

I am ignoring the left recursion part for now because I think the semantics seem not be settled down so far.

But for the PEG part the semantics are well defined and I think I see a possible source for confusion about how to update the memo table correctly.

If I miss the point it would be nice to correct me. My argumentation is on intuition level because I don't feel ready for code and grammar theory level yet.

For me Bottom-Up parsing intuively is like a childrens game where differently shaped pieces fell into matching holes. For grammars multiple holes of the same shape exists and the holes are interconnected. That results in mismatches because we don't know if the 'f' will be part of a keyword 'if' or of the variable named 'flag'.

When adding more and more pieces some former matches turn up to be wrong because they don't fit into the bigger picture of the world we have.

BUT and this is my main point: This does not invalidate the former valid matches of the smaller view of the world.

So conceptually the meta table IMHO should be considered IMMUTABLE. We are just adding facts that improve our view of the world and as far as I understand PEG parsers there will be only one parse left when we enter the top and the grammar matches the input.

Having a mutable meta table is OK if that improves performance cpu or memory wise as long as it preserves the behavior.

Matches are propagated up to parent rules and this sometimes multiple levels wide in case of recursion or of eplison matches. But I think the parent matches can become false matches later while the lower match stays to be valid and can probably later be used successully to complete a longer match.

This kind of nesting and back tracking in the view of "bigger picture" seems to be missing from the discussion. "Longer matching" or "best matching" (I don't understand what this shall mean) and a single (Clause, Position) entry in the memo table seems not be able to handle the global situation locally currently.

I think a stacking/nesting is missing which seems to be provided implicitly by the recursive methods calls in a recursice descendant parser. A rule must only see parsing results that are further down the call stack. Reducing the caching level from "clause" to "rule" will in my opinion reduce the problems but it does no solve the underlying cause of wrongly changing the facts aka the parse results of parsers further down. Those are conceptually immutable and don't change.

Some way to change that that comes to mind is using nested contexts that are checked for the nearest definition of (Clause, Pos). There are of cource more ways to solve this.

Hope that makes sense and helps!

Best regards, Cal

lukehutch commented 2 years ago

Hi @cal101,

Thanks for the thought you put into this.

If I understand your question correctly, you are saying that the table should not be updated.

I think you're missing one point -- the memos in the memo table are connected together in a tree structure, so the current parse tree note stored in any entry of the memo table is actually a pointer to a subtree of (bottom-up) partially-parsed input. When an entry in the memo table is overwritten, it's only ever overwritten because the parser has gone once around a loop in the grammar, and that means that the previous value stored at that position in the memo table is already linked into the new parse tree fragment that is newly stored in the same memo table entry. In other words, the old parse tree fragment that was stored in the table is actually already linked in to the new parse tree fragment that is written into the memo table entry, specifically the old fragment is a subtree of the new fragment. (This is ensured by the condition that memo table entries are only ever overwritten if they match at least one additional character.)

Does that make sense? It's confusing because the whole process of recursive descent is inverted, run backwards, and made iterative...

maxfischer2781 commented 2 years ago

@lukehutch just wanted to say Thanks for helping me (and our new visitor) along. I think you've cleared up my confusions – feel free to close the issue.

lukehutch commented 2 years ago

You're welcome -- closing as requested. Thanks for the question!