lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

UnexpectedToken when there should be no error #1284

Closed aha79 closed 1 year ago

aha79 commented 1 year ago

Describe the bug

Identical grammar gives "UnexpectedToken" error.

To Reproduce

Consider the following code

    import lark

    grammar = r"""
    main: char*
        | "[" atom* "]"

    atom: SETCHAR
        | null_escape

    char: PATTCHAR
        | null_escape

    null_escape: "\\" "0"
    null_escape2: "\\" "0"

    SETCHAR: /[^\-\]\\]/                         // all except for  "]"   "-"  or "\" 
    PATTCHAR: /[^\^\$\\\.\+\?\*\(\)\[\]\{\}\|]/
    """
    parser = lark.Lark(grammar, start='main', parser="lalr")
    parser.parse("[\\0a]")

Raises error

lark.exceptions.UnexpectedToken: Unexpected token Token('PATTCHAR', 'a') at line 1, column 4.
Expected one of: 
    * SETCHAR
    * BACKSLASH
    * RSQB

However when I change to

    char: PATTCHAR
        | null_escape2   // note the "2" differently named but identical rule otherwise

no exception is generated.

Also when I omit the star in first rule, i.e. main: char no error is generated. However this changes the grammar of course.

I have no idea why this is the case, but I suspect this must be some kind of bug.

I use version 1.1.5.

erezsh commented 1 year ago

This is probably happening just because of the default priority of terminals adjusting slightly due to different hashes in each run. (and maybe with the additional complication of interacting with shift/reduce collisions)

The real problem seems to me that SETCHAR and PATTCHAR can match the same character, and there is no way for LALR to know which one to choose. I recommend that you either set their priorities explicitly (like SETCHAR.10), or use the Earley parser.

aha79 commented 1 year ago

Thank you for your prompt response.

I was (perhaps incorrectly) assuming that inside the "[" atom* "]" only the SETCHAR terminal is expected and hence only this token generated by the lexer.

I solved it this way

    XCHAR: /[^\^\$\\\.\+\?\*\(\)\[\]\{\}\|\-]/

    pattchar: XCHAR
        | "-"
    setchar: XCHAR
        | /[\^\$\.\+\?\*\(\)\[\{\}\|]/

which is perfectly fine for me.

erezsh commented 1 year ago

inside the "[" atom* "]" only the SETCHAR terminal is expected

Lark can do that to some degree with LALR, but not perfectly. I think that they're both included because they are both in the FOLLOW set of null_escape.