lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Provide custom LOOKAHEAD on LALR grammar #1247

Open fabioz opened 1 year ago

fabioz commented 1 year ago

By default the LALR grammar can have only a single lookahead, but it'd be really nice if it could have a custom lookahead on specific cases (I got used to JavaCC which implements this with something as a LOOKAHEAD(2) in the proper place to avoid the restriction).

The use case I have is below. From what I see, apparently the ?identifier: NAME (WS NAME|WS NAME_CONT)* sees the WS and takes that route but can't see that the whole construct is actually optional and should not keep matching (in JavaCC I'd put a LOOKAHEAD(2) there and it'd try to make the whole match and if it matched just the first rule but not the 2nd it'd be Ok).

p.s.: although earley works for this particular construct it doesn't work for the full grammar I'm working at, so, using it isn't really a solution...

Error

My name    param 1 passed
        ^
Expected one of: 
    * NAME_CONT
    * _NEWLINE
    * NAME

Previous tokens: Token('WS', ' ')

Sample code

from lark.indenter import Indenter
from lark import Lark

class PythonIndenter(Indenter):
    NL_type = "_NEWLINE"
    OPEN_PAREN_types = ["LPAR", "LSQB", "LBRACE"]
    CLOSE_PAREN_types = ["RPAR", "RSQB", "RBRACE"]
    INDENT_type = "_INDENT"
    DEDENT_type = "_DEDENT"
    tab_len = 8

lark_spec = Lark(
    r"""
file_input: (_NEWLINE | root_stmt)*
?root_stmt: func_block

func_block:  BLOCK WS* "Function" WS* BLOCK WS* _NEWLINE (func_stmt)*

// i.e.: at least 2 spaces so that we have "Function name    arguments"
func_stmt: identifier WS WS+ parameters? func_suite

parameters: param ("," WS* param)* ("," WS*)?
param: param_name ["=" WS* param_default]
param_name: identifier
param_default: identifier

func_suite: _NEWLINE (_INDENT stmt+ _DEDENT)?

?identifier: NAME (WS NAME|WS NAME_CONT)*
?stmt: identifier _NEWLINE

NAME: /(?!(OR|AND|IN)\b)\b[^\d\W]\w*/
NAME_CONT: /(?!(OR|AND|IN)\b)\b\w+/
BLOCK: /\*\*\* */
WS: /[ ]/
_NEWLINE: ( /\r?\n[ ]*/ | COMMENT )+
COMMENT: /#[^\n]*/

%declare _INDENT _DEDENT
    """,
    parser="lalr",
    lexer="contextual",
    postlex=PythonIndenter(),
    start="file_input",
    keep_all_tokens=True,
    propagate_positions=True,
    debug=True,
)

if __name__ == "__main__":
    lark_spec.parse(
        """
*** Function ***
My name    param 1 passed
    Pass
""",
    )
erezsh commented 1 year ago

I agree, a custom lookahead, aka LALR(k), would be a really nice feature. And a very difficult one to implement correctly.

MegaIng commented 1 year ago

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

fabioz commented 1 year ago

Why are you explicitly putting down WS? Since that is ignored anyway, it has no purpose here.

Humm... probably I don't understand it enough then. Why is it ignored? Is there a way to not ignore it? In this particular grammar I'd like to have 2 spaces as a separator. Is this not possible?

i.e.: The code below would be valid code (as the identifier can have spaces):

Function name Function arg 1, Function arg 2

MegaIng commented 1 year ago

oh lol, I thought you had an %ingnore statement in there since you were using the PythonIndenter. That one might break if you aren't ignoreing Inline WS: