How is the python.lark rule "class" name parsed due to whitespace?

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.64k stars 397 forks source link

How is the python.lark rule "class" name parsed due to whitespace? #1349

Closed RevanthRameshkumar closed 9 months ago

RevanthRameshkumar commented 9 months ago

the python.lark file contains this line: %ignore /[\t \f]+/ // WS but if this is the case, then for the class definition rule which has "class" name, wouldn't class FooBar and classFooBar both be parsed as a class definition? (the second one being incorrect)

MegaIng commented 9 months ago

No, the lexer realizes that that the keyword class is fully contained within the Terminal IDENTIFIER, and when it sees an IDENTIFIER-type token with value class it instead emits a CLASS-type Token. So at the start of that line it expects some Token, lexes IDENTIFIER-type token with value classFooBar and doesn't emit a CLASS-type Token since it doesn't exactly match the string class.

RevanthRameshkumar commented 9 months ago

where is IDENTIFIER defined? I don't see it in python.lark

MegaIng commented 9 months ago

Aha, it's actually called NAME in the python grammar.

RevanthRameshkumar commented 9 months ago

Ok, so if you have 2 terminals A and B, and A is always a substring of B (starting at index 0), then the lexer will choose A over B for a string that fits A as long as the string is isolated by whitespace of any sort? So if

d: /.*/
c: A /e*/
A: "abc"
B: "abcd"

then "abc eeee" should be parsed as c with terminals A and anon for the e*, and "abcdeeee" should be parsed as d?

MegaIng commented 9 months ago

This is only reliably implemented for literal strings, using regex syntax, case-insensitive flags or maybe even explicit terminal names might not work correctly.
The /.*/ will consume everything, meaning it matches "abc eeee", which doesn't fit any of the literals (if they were proper literals).

RevanthRameshkumar commented 9 months ago

Ah gotcha, so the subset terminal must be a literal string...it cannot be a regex subset.

Ok, so this works as expected.

parser = Lark(r"""

%ignore /[\t \f]+/  // WS
start: d|c
d: B
c: A B
A: "abc"
B: /[^\W\d]\w*/

""", parser="lalr")
parser.parse('abc eee')

And this fails as expected due to lexer confusion:

parser = Lark(r"""
%ignore /[\t \f]+/  // WS

start: d|c
d: B
c: A B
A: /abc.*/
B: /[^\W\d]\w*/
""", parser="lalr")
print(parser.parse('abc eee'))

because it is parsing "abc" as B, and it has no where to go after that.

MegaIng commented 9 months ago

You almost never want to have .* in any regex in your grammar, that is almost surely going to cause problems. Not that regex subset checking is something we could add (we already have a library that can do), but it would be a bit computation expensive, and we haven't really found any IRL situations where it would be that beneficial.

RevanthRameshkumar commented 9 months ago

isn't /[^\W\d]\w*/ basically as expensive as . due to the ? I realize [^\W\d]\w is a subset of ., but doesn't the * make it as computationally expensive?

MegaIng commented 9 months ago

Expensive in terms of computations is basically completely irrelevant inside regex (unless you are using earley with dynamic complete) when you want to match stuff, regex intersection matching is expensive since we can't use the stdlib library for that. The point is that it will cause problems for you, the grammar author, when it matches stuff you didn't want it to match. You almost never want a token that matches "everything to the end of the file", which is what .* does. \w* is way more explicit in what it is supposed to match.

RevanthRameshkumar commented 9 months ago

Ah I see what you mean. Thanks! The question was thoroughly answered so I am going to close the thread. Hope others reading will also have a better understanding of this behaviour :) My goal is to avoid the "abc(?=\s)" usage when defining keywords and I think I can do that now with this.