lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

Add "_" as a valid name for Python grammar in non-match contexts #1282

Closed dmelcer9 closed 1 year ago

dmelcer9 commented 1 year ago

Previously, the example Python grammar wouldn't parse text like:

for _ in foo:
  pass
erezsh commented 1 year ago

>>> from python_parser import python_parser3
>>> code = r"""for _ in foo:
...   pass
... """
>>> import rich
>>> rich.print(python_parser3.parse(code))
file_input
└── for_stmt
    ├── var
    │   └── name
    │       └── _
    ├── var
    │   └── name
    │       └── foo
    ├── suite
    │   └── pass_stmt
    └── None
dmelcer9 commented 1 year ago

Very weird, going to look at the code when I get in and figure out a more complete example where it goes wrong.

erezsh commented 1 year ago

Okay. I'm closing the issue for now, but feel free to re-open if you think it's worthwhile.

dmelcer9 commented 1 year ago

The issue appears to be in how the earley parser specifically interacts with the grammar:

from lark import Lark
from lark.indenter import PythonIndenter

l = Lark.open_from_package("lark", "python.lark", ['grammars'], parser="earley", postlex=PythonIndenter(), start="file_input")
print(l.parse("for _ in foo:\n  pass\n"))

Results in:

lark.exceptions.UnexpectedToken: Unexpected token Token('UNDERSCORE', '_') at line 1, column 5.
Expected one of: 
    * TRUE
    * BIN_NUMBER
    * TILDE
    * __ANON_24
    * NONE
    * STRING
    * MATCH
    * NAME
    * FALSE
    * OCT_NUMBER
    * PLUS
    * LPAR
    * FLOAT_NUMBER
    * CASE
    * DEC_NUMBER
    * MINUS
    * LBRACE
    * LSQB
    * STAR
    * IMAG_NUMBER
    * HEX_NUMBER
    * LONG_STRING
    * AWAIT
dmelcer9 commented 1 year ago

@erezsh I am unable to re-open the PR from github's interface for some reason

MegaIng commented 1 year ago

The problem is lexer="basic", which gets chosen automatically when you use postlex and Earley. The python grammar is designed for parser=lalr, lexer=contextual.

erezsh commented 1 year ago

@MegaIng I think you're right.

However, @dmelcer9 , your solution won't fix it, since there are other uses of NAME throughout the grammar, which means that code with "as _" still won't work, etc.

Possibly the easiest solution would be to remove the line

       | "_" -> any_pattern

And then handle this case after the parse is done.

But I'm not sure it's worth it. Parsing Python with Earley isn't really a common use-case. It's only in the examples to show that it's possible.