lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 406 forks source link

Exception rules #434

Closed eddieschoute closed 5 years ago

eddieschoute commented 5 years ago

As part of language specifications, keywords are frequently defined as illegal identifiers. For example, Python does not allow certain keywords as identifiers (https://docs.python.org/3/reference/lexical_analysis.html?highlight=keywords#keywords). In the example Python 3 grammar if Lark, these keywords seems to be accepted as names. Should this be rejected at the parser level or after parsing the grammar?

Ambiguity from keywords

My problem is a bit more specific. Say I hardcode a function foo to have a certain meaning in my language. Now I not only wish to forbid any function declarations of foo, but I also wish to parse calls foo() to parse to a foo node (and not a general function call node). However, at the grammar level this currently creates an ambiguity: A line foo() can parse to a (general) function call and a keyword node. The obvious solution to solve this at the parser level, to me, is to disallow NAME or ID terminals from being keywords. Is there a way to do this? Surely it is possible using regular expression but that would result in somewhat hard to read code.

From wikipedia on EBNF (https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Table_of_symbols) we see that - rule gives an exception, allowing the grammar to reject specific rules.

For my case I could consider solving this after the parsing stage. For example, by checking if function calls of keywords are in some of the nodes and then checking that their structure is as expected (it is more restrictive than for general functions).

erezsh commented 5 years ago

Lark knows to give priority to keywords, even when they collide with regexps.

So give NAME: /\w+/ and WHILE: "while", the lexer will always match WHILE when possible.

The obvious solution to solve this at the parser level, to me, is to disallow NAME or ID terminals from being keywords.

So, in a way, this happens automatically in parser="lalr". When using parser="earley" (the default), it tries both, but still give priority to the keyword.

Does that answer your question?

eddieschoute commented 5 years ago

How is this priority decided? Say I have

exp: "foo" "(" ")"
    | NAME "(" ")"

NAME: /\w+/

as a grammar. Which branch would be taken for foo()? Do I need to define the first rule as FOO "(" ")" with an additional terminal of FOO: "foo" to nudge the parser in the right direction?

Edit: Or would "foo" be converted to a FOO terminal by the tokenizer before the parser sees it?

eddieschoute commented 5 years ago

I think I see now how this solves that issue: See the priority rules at https://lark-parser.readthedocs.io/en/stable/grammar/#terminals Then if these keywords are tokenized to specialised terminals, they cannot be used for general ID terminals (and general function calls)!

eddieschoute commented 5 years ago

I just tried this with the rules

FOO: "foo"
ID: /[a-z][A-Za-z0-9]*/

and, still, the text foo seems to get tokenized to a Token(ID, "foo"). And from there my other rules go wrong. So perhaps the priority rules don't get applied properly?

erezsh commented 5 years ago

This is very simple to test

g="""
!start: "foo" "(" ")"
    | NAME "(" ")"

NAME: /\w+/
"""

from lark import Lark
g=Lark(g)
print(g.parse("foo()")

Returns Tree(start, [Token(FOO, 'foo'), Token(LPAR, '('), Token(RPAR, ')')])

But do make sure you're using the latest master, just this week we finished a patch to Earley which solves a bug that might affect this result.

eddieschoute commented 5 years ago

Aha, so it was that issue. I was able to reproduce nondeterministic behavior with the latest pypi version. Updating to the master branch fixed it.

erezsh commented 5 years ago

It seems that we fixed most (if not all?) nondeterministic behavior, so that's good news. I'll probably make an official release soon. If you do find any remaining nondeterminism, be kind to let us know!