lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.88k stars 414 forks source link

UnexpectedToken exception if terminal newline ignored #376

Closed reidpr closed 5 years ago

reidpr commented 5 years ago

I have the following MWE:

$ cat lark_bug.py
import lark

GRAMMAR = r"""
?start: foo+

foo: "FOO"i / [A-Za-z0-9._-]+/ NEWLINE

NEWLINE: "\n"
%ignore NEWLINE  // comment out to avoid crash
"""

TEXT = "FOO bar\n"

parser = lark.Lark(GRAMMAR, parser="lalr", propagate_positions=True)
tree = parser.parse(TEXT)

Actual behavior:

$ pip3 freeze | fgrep lark
lark-parser==0.7.1
$ python3 --version
Python 3.5.3
$ python3 bin/lark_bug.py
Traceback (most recent call last):
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/parsers/lalr_parser.py", line 59, in get_action
    return states[state][token.type]
KeyError: '$END'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/lark_bug.py", line 15, in <module>
    tree = parser.parse(TEXT)
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/lark.py", line 292, in parse
    return self.parser.parse(text)
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/parser_frontends.py", line 79, in parse
    return self.parser.parse(token_stream, *[sps] if sps is not NotImplemented else [])
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/parsers/lalr_parser.py", line 36, in parse
    return self.parser.parse(*args)
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/parsers/lalr_parser.py", line 96, in parse
    _action, arg = get_action(token)
  File "/home/reidpr/.local/lib/python3.5/site-packages/lark/parsers/lalr_parser.py", line 62, in get_action
    raise UnexpectedToken(token, expected, state=state)
lark.exceptions.UnexpectedToken: Unexpected token Token($END, '') at line 1, column 4.
Expected one of:
    * NEWLINE

Expected behavior: No exception; parse tree does not have NEWLINE tokens in it.

Am I doing something wrong? Is this a bug in Lark? What other information can I provide?

This does not happen with the Earley parser. However, I am using the LALR(1) parser because Earley is giving me some nondeterministic behavior in my actual application.

Thank you for your hard work on Lark. It is very pleasant to work with.

erezsh commented 5 years ago

In LALR, when you %ignore a terminal, it means it gets dropped, and never reaches the parser.

Earley knows to try it both ways, which is why it works.

I think what you wanted to do, is the following:

?start: foo+

foo: "FOO"i / [A-Za-z0-9._-]+/ _NEWLINE

_NEWLINE: "\n"    // underscored terminals are automatically removed from the tree
reidpr commented 5 years ago

OK, thank you. Is that the same reason why the following grammar:

?start: foo+

foo: "FOO"i SPACE /[A-Za-z0-9._-]+/ _NEWLINE

SPACE: " "
%ignore SPACE
_NEWLINE: "\n"

gives:

lark.exceptions.UnexpectedCharacters: No terminal defined for 'b' at line 1 col 5

FOO bar
    ^

Expecting: {'SPACE'}

This also works if I underscore SPACE instead of %ignoreing it.

erezsh commented 5 years ago

Yes, same reason.

It's probably better to ignore whitespace, and then just not write it in the grammar.

But if you have to control for whitespace, don't ignore it.

reidpr commented 5 years ago

I think maybe I did not understand %ignore correctly. So if I say (as in the JSON tutorial):

%import common.WS
%ignore WS

This means that any WS terminals that appear anywhere are just ignored, and need not be specified in the grammar (and thus WS terminals are accepted anywhere). On the other hand, the underscore prefix says to match the terminal but remove it from the tree after the tree is constructed.

Is that correct?

Does %ignore affect positions recorded by propagate_positions=True?

erezsh commented 5 years ago

Yes, that is correct.

Positions should be always correct, regardless of %ignore or otherwise.

reidpr commented 5 years ago

Thanks so much. That's all extremely helpful and clarifies perfectly.

erezsh commented 5 years ago

Good!