lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

Req: option for watching token streams and candidate targets #1320

Closed davidmcnabnz closed 5 days ago

davidmcnabnz commented 1 year ago

Suggestion Requesting constructor keyword options to allow logging the lexer tokens stream. Also, if feasible, the potential target fulfilments in the current context.

Describe alternatives you've considered The PyCharm debugger has sophisticated breakpoint options, including the ability to set a breakpoint to:

  1. Stay dormant until or unless a specific other breakpoint is reached, then become active
  2. Every time reached, execute an arbitrary Python statement (in my case, a print statement for the token)

Additional context Printing the token stream, via the above IDE debugger breakpoint technique, has been a huge support in my current project.

(FYI, this requires carefully retro-implementing a parser for an archaic, convoluted and very non-standard programming/configuration language from the 1980s, whose parser was originally implemented in hand-crafted C, incrementally coded/patched/extended in a silo over the decades, and with no formal grammar specification, not even YACC. Getting its various cryptic nuances to parse and correctly feed into my transformer is a massively challenging undertaking, but I'm getting there.)

I would really like to be able to watch or log the LARK parser's token stream without reliance on the IDE. Even if a constructor option allowed passing an open writeable file object, and/or a logger object, and/or the pathname of a file to write to, this would be very helpful.

In a perfect world, for each token fetched and logged, it would be even better to see the current line/column numbers in the input at which the token was matched.

I acknowledge that logging of parser state would be a much harder venture, especially to do so in a readable manner. So even just token stream logging would be quite a boost.

erezsh commented 1 year ago

For LALR, this is very easy to do using the Lark.parse_interactive() method, and then calling iter_parse().

See this recipe for an example: https://lark-parser.readthedocs.io/en/latest/recipes.html#adding-a-progress-bar-to-parsing-with-tqdm

I'm not sure this is relevant for Earley, since it matches and considers many different tokens, that are eventually thrown away. i.e. it's not exactly a stream of tokens.

davidmcnabnz commented 1 year ago

@erezsh thanks for that. I have been meaning to check out .parse_interactive(), but I've been a bit too tight-looped in my current project (well that's the official excuse anyway ;) ).

I'll make a point of trying it out today. I'm guessing it will have big payoffs when I hit a lot more of the ancient syntax's nitty-gritties.

davidmcnabnz commented 1 year ago

I just tried it out, but noticed that the .result wasn't getting assigned with the parse tree. After a bit of stepping through LARK internals, I noticed the parser wasn't seeing EOF. Sorted by manually calling .feed_eof() after the iterator loop quits.

Sample code is below, and has got me exactly where I need to be :) :

    def parseInteractive(self, raw, *args, **kw):
        pi = self.parser.parse_interactive(raw, *args, **kw)
        for token in pi.iter_parse():
            if self.debug:
                ctr = pi.lexer_state.state.line_ctr
                line = ctr.line
                column = ctr.column
                tokType = token.type.split('__')[-1]
                tokVal = token.value
                print(f"TOKEN:{line}:{column}:{tokType}={repr(tokVal)}")

        # apparently the interactive parser never sees $END, so we have to                                                     
        # feed it in explicitly                                                                                                
        pi.feed_eof()

        # now we can harvest the transformed tree                                                                              
        result = pi.result
        return result

As a takeaway, there might be merit in adding a couple of properties to the InteractiveParser object to allow cleaner and more future-proof access to line, column and character position values, to cover cases where .lexer_state.state.line_ctr.line etc access paths gets broken by future updates.

But for now, I'm delighted to have so much transparency in the parser's activity. Thanks again!

erezsh commented 1 year ago

It looks like the recipe isn't entirely correct. To get the result, you should call pi.resume_parse(). See example here: https://github.com/geographika/mappyfile/blob/master/mappyfile/parser.py#L218

As for the line and column numbers, why don't you just take them from the token?