lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

Enforcing whitespace around keywords only? #1297

Closed RevanthRameshkumar closed 1 year ago

RevanthRameshkumar commented 1 year ago

What is your question?

How do I generally ignore whitespace but enforce it around keywords only? If I use the ignore whitespace directive then both of the examples below will parse

Int x = 1 ; Vs intx = 1 ;

But I only want the first to parse

If you're having trouble with your code or grammar

Provide a small script that encapsulates your issue.

Explain what you're trying to do, and what is obstructing your progress.

erezsh commented 1 year ago

probably the easiest solution is to use regex lookahead to enforce this.

RevanthRameshkumar commented 1 year ago

Ok, so something like this?

%ignore WS_INLINE
%import common.WS_INLINE

start: exists_decl
variable: /z\d+/
EXISTS: /exists(?=\s)/
exists_decl:EXISTS variable

That seems to do the job for me since existsz2 fails but exists z2 succeeds. And to clarify, this works because the lexer will handle all regexes before applying the ignores? I didn't know that was the case.

erezsh commented 1 year ago

It's simply that the lexer reads tokens one by one, whether ignored or not. If it can match "exists" and then "z", it doesn't care if there is an ignored token in between or not.

Funny enough, the basic lexer behaves the way you'd expect, requiring space between keywords and names. The contextual lexer (the default) is too smart for its own good, it knows that "existsz2" isn't possible, so it parses them as two tokens. (but switching to the basic lexer isn't recommended)

RevanthRameshkumar commented 1 year ago

Gotcha. In that case, is there a reason to use a lookahead vs just something like: EXISTS: "exists"i " "+ is it that the lookahead is more concise stylistically?

erezsh commented 1 year ago

This isn't about style, it's about practice. If you put " "+ it means that the lookahead is going to see whitespace, instead of the token afterwards. (since our LALR implementation only has a lookahead of 1). That would seriously hinder the parser analysis.

RevanthRameshkumar commented 1 year ago

Thanks, that makes sense! I actually just got a v1 of my grammar totally working now. Thanks for your help :)