"|" and "||" clashes with priorities

alcides commented 4 years ago

Describe the bug

Using "|" and "||" as terminals in lark rules works well independently.

However, when one defined "|" as a TOKEN, the following "||" in the rule stops working. I do not believe it is a grammar problem, as I have shown this to occur in the very simple hello world grammar below.

To Reproduce

The following code:

from lark import Lark

l_ok1 = Lark('''start: "|" WORD "," WORD "!"

            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''',
        parser='lalr',
        lexer='standard')

l_ok2 = Lark('''start:  _PIPE WORD "," WORD "!"
            _PIPE.15 : "|" | "where"
            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''',
        parser='lalr',
        lexer='standard') 

l_ok3 = Lark('''start: WORD "||" WORD "!"

            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''',
        parser='lalr',
        lexer='standard')

l_ok4 = Lark('''start: "|" WORD "||" WORD "!"

            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''',
        parser='lalr',
        lexer='standard')

l_nok = Lark('''start: _PIPE WORD "||" WORD "!"

            _PIPE.15 : "|" | "where"
            %import common.WORD   // imports from terminal library
            %ignore " "           // Disregard spaces in text
         ''',
        parser='lalr',
        lexer='standard')

print( l_ok1.parse("| Hello, World!") )
print( l_ok2.parse("| Hello, World!") )
print( l_ok3.parse("Hello || World!") )
print( l_ok4.parse("| Hello || World!") )
print( l_nok.parse("| Hello || World!") )

Returns this output:

Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])
Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])
Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])
Tree(start, [Token(WORD, 'Hello'), Token(WORD, 'World')])
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 61, in get_action
    return states[state][token.type]
KeyError: '_PIPE'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lark_bug.py", line 50, in <module>
    print( l_nok.parse("| Hello || World!") )
  File "/usr/local/lib/python3.7/site-packages/lark/lark.py", line 333, in parse
    return self.parser.parse(text, start=start)
  File "/usr/local/lib/python3.7/site-packages/lark/parser_frontends.py", line 88, in parse
    return self._parse(token_stream, start)
  File "/usr/local/lib/python3.7/site-packages/lark/parser_frontends.py", line 54, in _parse
    return self.parser.parse(input, start, *args)
  File "/usr/local/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 35, in parse
    return self.parser.parse(*args)
  File "/usr/local/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 85, in parse
    action, arg = get_action(token)
  File "/usr/local/lib/python3.7/site-packages/lark/parsers/lalr_parser.py", line 64, in get_action
    raise UnexpectedToken(token, expected, state=state)
lark.exceptions.UnexpectedToken: Unexpected token Token(_PIPE, '|') at line 1, column 9.
Expected one of: 
    * __ANON_0

MegaIng commented 4 years ago

You are giving _PIPE a high priority, so || is parsed as _PIPE _PIPE. That is intended behavior. Why is this a problem?

Also, not using lexer='standard' also fixes this problem.

alcides commented 4 years ago

I believed the disambiguation was through the longest sequence first and only then the priority.

I wanted _PIPE rule in order for where to have higher priority than word, but i wanted "||" to have higher priority than "|" in that particular context.

I have fixed my problem, but maybe this behaviour should be clear in the documentation. Most other parsing tools prioritize length over relative priority.

MegaIng commented 4 years ago

Yes, Lark does that too. You have to take three extra steps to convince Lark to not parse this correctly:

Use a TOKEN instead of an literal
Give that TOKEN a high priority
Use the standard lexer instead of the default contextual, which almost never is a good idea (Slower performance, and these kind of problems)

Why are you using the standard lexer?

alcides commented 4 years ago

No reason at all. Probably because I thought it was the default (standard has a very defaulty name).

lark-parser / lark

"|" and "||" clashes with priorities #582