Closed eddieschoute closed 5 years ago
Lark knows to give priority to keywords, even when they collide with regexps.
So give NAME: /\w+/
and WHILE: "while"
, the lexer will always match WHILE when possible.
The obvious solution to solve this at the parser level, to me, is to disallow NAME or ID terminals from being keywords.
So, in a way, this happens automatically in parser="lalr"
. When using parser="earley"
(the default), it tries both, but still give priority to the keyword.
Does that answer your question?
How is this priority decided? Say I have
exp: "foo" "(" ")"
| NAME "(" ")"
NAME: /\w+/
as a grammar. Which branch would be taken for foo()
? Do I need to define the first rule as FOO "(" ")"
with an additional terminal of FOO: "foo"
to nudge the parser in the right direction?
Edit: Or would "foo" be converted to a FOO terminal by the tokenizer before the parser sees it?
I think I see now how this solves that issue: See the priority rules at https://lark-parser.readthedocs.io/en/stable/grammar/#terminals
Then if these keywords are tokenized to specialised terminals, they cannot be used for general ID
terminals (and general function calls)!
I just tried this with the rules
FOO: "foo"
ID: /[a-z][A-Za-z0-9]*/
and, still, the text foo
seems to get tokenized to a Token(ID, "foo")
. And from there my other rules go wrong. So perhaps the priority rules don't get applied properly?
This is very simple to test
g="""
!start: "foo" "(" ")"
| NAME "(" ")"
NAME: /\w+/
"""
from lark import Lark
g=Lark(g)
print(g.parse("foo()")
Returns Tree(start, [Token(FOO, 'foo'), Token(LPAR, '('), Token(RPAR, ')')])
But do make sure you're using the latest master, just this week we finished a patch to Earley which solves a bug that might affect this result.
Aha, so it was that issue. I was able to reproduce nondeterministic behavior with the latest pypi version. Updating to the master branch fixed it.
It seems that we fixed most (if not all?) nondeterministic behavior, so that's good news. I'll probably make an official release soon. If you do find any remaining nondeterminism, be kind to let us know!
As part of language specifications, keywords are frequently defined as illegal identifiers. For example, Python does not allow certain keywords as identifiers (https://docs.python.org/3/reference/lexical_analysis.html?highlight=keywords#keywords). In the example Python 3 grammar if Lark, these keywords seems to be accepted as names. Should this be rejected at the parser level or after parsing the grammar?
Ambiguity from keywords
My problem is a bit more specific. Say I hardcode a function
foo
to have a certain meaning in my language. Now I not only wish to forbid any function declarations offoo
, but I also wish to parse callsfoo()
to parse to afoo
node (and not a general function call node). However, at the grammar level this currently creates an ambiguity: A linefoo()
can parse to a (general) function call and a keyword node. The obvious solution to solve this at the parser level, to me, is to disallowNAME
orID
terminals from being keywords. Is there a way to do this? Surely it is possible using regular expression but that would result in somewhat hard to read code.From wikipedia on EBNF (https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Table_of_symbols) we see that
- rule
gives an exception, allowing the grammar to reject specific rules.For my case I could consider solving this after the parsing stage. For example, by checking if function calls of keywords are in some of the nodes and then checking that their structure is as expected (it is more restrictive than for general functions).