Partial parsing - Githubissues

Daniel63656 commented 8 months ago

Is it possible to parse one token at a time, building a parse tree incrementally in the process? One could then also run a custom interpreter at each step. I know there is parser.parse_interactive() but it does not build the parse tree incrementally for some reason and therefore needs the type for each token as well.

erezsh commented 8 months ago

It does build the parse-tree incrementally. You can find it at ip.parser_state.value_stack, where ip: InteractiveParser. But it might be a bit tricky to use it effectively.

Daniel63656 commented 8 months ago

I am a bit confused about this incremental parsing. I want to feed it one token at a time but it requires a Token class that also has a type. This type is the current production rule I guess? Why is this necessary when the parse tree is constructed incrementally? The current rule should be obvious then or not? Also value_stack is a stack and not a tree.

Edit: It seems these types are generated by the parser but I don't know them. I only have the tokens themselves (which should be enough why link them all to a different string anyway?)

erezsh commented 8 months ago

Like you said, tokens have a type and a value. The value is the actual bit of text being parsed, and the type is the category of that text. The parser doesn't look at the value at all. It only looks at the type, in order to choose how to update the state-machine.

value_stack is a stack, of course.. it's a stack of trees (or tokens).

Daniel63656 commented 8 months ago

Yes but I don't know the type. I just want to add tokens (values) one at a time. I thought at first type has something to do with the rules they are used in but it's just another string that the parser uses instead of the actual (value) string. Seems kind of unnecessary. How would I even get these types?

MegaIng commented 8 months ago

You define them. In the grammar. In the above grammar, it's the A, B, C, i.e. the left side of the :. What exactly the token types are depends on your grammar. If you use literals directly in your grammar, lark with automatically choose names. If you print parser.terminals, you can see a list of all defined terminals. This isn't unnecessary: The parser doesn't care about the original text at all, it only cares about the token types. One token type can represent an infinite number of actual text strings. If you want to skip the lexer step, you need to do the step of the lexer and provide the token types.

Daniel63656 commented 8 months ago

I'm sorry I am new to this. I have following grammar:

E: "treble" | "bass" | "staffchange" R: /rest[0-7]/ P: "start" | "continue" | "stop" D: "up" | "dn" A: "#" | "b" | "nat" | "x" | "bb" N: "C" | "D" | "E" | "F" | "G" | "A" | "B" O: "8va" | "8vb" clef: "gclef" | "fclef" digit: /[0-9]/

The expressions left of the ':' are my non-terminals. Now if I print the terminals parser = Lark(grammar, parser="lalr", start='score') print(parser.terminals)

I get for example TerminalDef('E', '(?:staffchange|treble|bass)') TerminalDef('R', 'rest[0-7]') TerminalDef('VBAR', '\|') TerminalDef('GRACE', 'grace') TerminalDef('DOT', '\.') TerminalDef('__ANON_0', '[0-9]')

I discovered that the parser works when I use these categories as types e.g. Token('DOT', '.'). But I didn't define (and know) these types. If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like DOT: "." or GRACE:"grace" ? Should I define them myself (does every non-terminal need a production rule like nonterminal:terminal to cover them)?

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

MegaIng commented 8 months ago

If a terminal only appears on the right side of productions like ".", "grace", or "flag" then they get their own new type like DOT: "." or GRACE:"grace" ?

Yes, lark automatically gives names, as I mentioned before.

Should I define them myself ?

That makes the names more predicable, but it's not necessary.

does every non-terminal need a production rule like nonterminal:terminal to cover them

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

In lark, everything in uppercase is a terminal, everything in lowercase is a nonterminal.

Some types like 'E', 'R', and so on are defined as sets by me but still why is 'digit' turned into __ANON_0?

Because lark couldn't find a better name. Every terminal needs a name.

You said a lexer would typically do this. Can I then also partially lex one token at a time to get the type?

Sure, you can try use parser.lex for that.

Daniel63656 commented 8 months ago

Not sure what you mean. Are you sure you are using the terms nonterminal and terminal correctly?

I am not sure. From the json_tutorial: rule_name : list of rules and TERMINALS to match | another possible list of items | etc.

TERMINAL: "some text to match"

here rule_name is a non-terminal. And an example for a TERMINAL could be : NUMBER : /-?\d+(.\d+)?([eE][+-]?\d+)?/ STRING : /".*?(?<!\)"/

What throws me off is that I read that left side of production rules must be non-terminals?! I also thought terminals are tokens like "flag" or "." in my case. But you say terminals are these categories?

Because lark couldn't find a better name. Every terminal needs a name.

But I defined a name: digit. That's kind of my point, if lark switches these names around then I can't reliably provide them myself. I didn't think of the Lexer. I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

MegaIng commented 8 months ago

But I defined a name: digit

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

I thought a Lexer only breaks a string of tokens into individual tokens tbh. Is generating these types also part of lexing and will they then be the same as parsers?

As I mentioned above: A big part of the job of the lexer is generating these token types. For the same grammar it will always be the same terminal names.

Daniel63656 commented 8 months ago

You defined a name for the nonterminal, not for the terminal. You need to define a name for the terminal, in uppercase.

Ah because I wrote it in lowercase? I had this in a university course and there nonterminals got capital letters I guess thats where the confusion comes from :confounded: And the type will then be the defined terminal got it. I should then probably change my grammar to:

... E: "treble" | "bass" | "staffchange" R: /rest[0-7]/ P: "start" | "continue" | "stop" D: "up" | "dn" A: "#" | "b" | "nat" | "x" | "bb" N: "C" | "D" | "E" | "F" | "G" | "A" | "B" O: "8va" | "8vb" CLEF: "gclef" | "fclef" DIGIT: /[0-9]/

But now I still need a way to get the type for a token

Sure, you can try use parser.lex for that.

next(parser.lex('bos')) returns a token but the wrong one in this case: Token('KEY', 'b') shouldn't the lexer match to Token('BOS', 'bos')?

This works fine when using the parser on entire input strings so I guess the question remains: How to use a lexer incrementally?

Daniel63656 commented 8 months ago

After searching the source code for a while I figured it is probably easiest to do this myself:

import re

class DynamicLexer:
    def __init__(self, terminals):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in terminals]

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")

# example usage
lexer = DynamicLexer(parser.terminals)
print(lexer.lex_token('bos'))
print(lexer.lex_token('bb'))
# expect this to raise error
print(lexer.lex_token('bbt'))

This very simple class uses the terminals from the parser (so they are always same as your defined grammar) and returns a correct token with type if and only if the token can be fully matched. No clue if this is redundant or something but I don't see how to make it work wit lark directly.

One can now wrap a parser to work on token strings only:

class IncrementalParser:
    def __init__(self, parser):
        self.terminals = [(terminal.name, terminal.pattern.to_regexp()) for terminal in parser.terminals]
        self.parser = parser.parse_interactive()

    def lex_token(self, token_string):
        for name, pattern_str in self.terminals:
            if re.fullmatch(pattern_str, token_string):
                return Token(name, token_string)
        raise ValueError(f"Token '{token_string}' could not be matched.")

    def feed_token(self, token_string):
        token = self.lex_token(token_string)
        self.parser.feed_token(token)

    def accepts(self):
        return self.parser.accepts()

lark-parser / lark

Partial parsing #1367