What is the best way to see if a sub-token character is acceptable in InteractiveParser?

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.81k stars 409 forks source link

What is the best way to see if a sub-token character is acceptable in InteractiveParser? #1344

Closed RevanthRameshkumar closed 1 year ago

RevanthRameshkumar commented 1 year ago

I want to use the interactive parser to see if the next letter in a stream is acceptable. If you run this code, the next accepted token is ":" But actually, the next accepted characters are colon, and a continuation of NAME which is letters and numbers:

from lark import Lark

parser = Lark(r"""
%ignore /[\t \f]+/  // WS

start: name|classdef

classdef: "class" name ":"
name: NAME | "match" | "case"
NAME: /[^\W\d]\w*/
""", parser="lalr")

input_str = r"""class fdsf"""
interactive = parser.parse_interactive(input_str)

print(interactive.exhaust_lexer())

print(interactive.accepts())

Is there a way to determine the next possible character here in an efficient way? The only way I can think of offhand is to append each char possibility to the string and re-run the parser which seems horrible.

RevanthRameshkumar commented 1 year ago

Likewise, if I run this code

from lark import Lark

parser = Lark(r"""
%ignore /[\t \f]+/  // WS

start: name|classdef

classdef: "class" name ":"
name: NAME | "match" | "case"
NAME: /[^\W\d]\w*/
""", parser="lalr")

input_str = r"""class"""
interactive = parser.parse_interactive(input_str)

print(interactive.exhaust_lexer())

print(interactive.accepts())

I get

[Token('CLASS', 'class')]
{'NAME', 'CASE', 'MATCH'}

which isn't exactly right because 'class' can just be a name, which means that any letter or number is also acceptable as a continuation

MegaIng commented 1 year ago

The interactive parser works on the level of Tokens, not on the level of Characters. You will have to work a bit harder, for example by remembering the last token and checking if there are other characters you can append to that and still get a working regex.

This wont necessary help with the class example. That isn't really fixable, you will have to special case Identifiers if you want to try and create a general solution.

RevanthRameshkumar commented 1 year ago

That helps, I realized I probably need an fsm based thing and then I ran into your interegular module @MegaIng!