Closed RevanthRameshkumar closed 9 months ago
No, the lexer realizes that that the keyword class
is fully contained within the Terminal IDENTIFIER
, and when it sees an IDENTIFIER
-type token with value class
it instead emits a CLASS
-type Token. So at the start of that line it expects some Token, lexes IDENTIFIER
-type token with value classFooBar
and doesn't emit a CLASS
-type Token since it doesn't exactly match the string class
.
where is IDENTIFIER
defined? I don't see it in python.lark
Aha, it's actually called NAME
in the python grammar.
Ok, so if you have 2 terminals A and B, and A is always a substring of B (starting at index 0), then the lexer will choose A over B for a string that fits A as long as the string is isolated by whitespace of any sort? So if
d: /.*/
c: A /e*/
A: "abc"
B: "abcd"
then "abc eeee" should be parsed as c
with terminals A
and anon
for the e*, and "abcdeeee" should be parsed as d
?
/.*/
will consume everything, meaning it matches "abc eeee"
, which doesn't fit any of the literals (if they were proper literals).Ah gotcha, so the subset terminal must be a literal string...it cannot be a regex subset.
Ok, so this works as expected.
parser = Lark(r"""
%ignore /[\t \f]+/ // WS
start: d|c
d: B
c: A B
A: "abc"
B: /[^\W\d]\w*/
""", parser="lalr")
parser.parse('abc eee')
And this fails as expected due to lexer confusion:
parser = Lark(r"""
%ignore /[\t \f]+/ // WS
start: d|c
d: B
c: A B
A: /abc.*/
B: /[^\W\d]\w*/
""", parser="lalr")
print(parser.parse('abc eee'))
because it is parsing "abc" as B
, and it has no where to go after that.
You almost never want to have .*
in any regex in your grammar, that is almost surely going to cause problems. Not that regex subset checking is something we could add (we already have a library that can do), but it would be a bit computation expensive, and we haven't really found any IRL situations where it would be that beneficial.
isn't /[^\W\d]\w*/
basically as expensive as . due to the ? I realize [^\W\d]\w is a subset of .
, but doesn't the * make it as computationally expensive?
Expensive in terms of computations is basically completely irrelevant inside regex (unless you are using earley with dynamic complete) when you want to match stuff, regex intersection matching is expensive since we can't use the stdlib library for that. The point is that it will cause problems for you, the grammar author, when it matches stuff you didn't want it to match. You almost never want a token that matches "everything to the end of the file", which is what .*
does. \w*
is way more explicit in what it is supposed to match.
Ah I see what you mean. Thanks! The question was thoroughly answered so I am going to close the thread. Hope others reading will also have a better understanding of this behaviour :) My goal is to avoid the "abc(?=\s)" usage when defining keywords and I think I can do that now with this.
the python.lark file contains this line:
%ignore /[\t \f]+/ // WS
but if this is the case, then for the class definition rule which has"class" name
, wouldn'tclass FooBar
andclassFooBar
both be parsed as a class definition? (the second one being incorrect)