Closed davidmcnabnz closed 5 days ago
For LALR, this is very easy to do using the Lark.parse_interactive()
method, and then calling iter_parse()
.
See this recipe for an example: https://lark-parser.readthedocs.io/en/latest/recipes.html#adding-a-progress-bar-to-parsing-with-tqdm
I'm not sure this is relevant for Earley, since it matches and considers many different tokens, that are eventually thrown away. i.e. it's not exactly a stream of tokens.
@erezsh thanks for that. I have been meaning to check out .parse_interactive()
, but I've been a bit too tight-looped in my current project (well that's the official excuse anyway ;) ).
I'll make a point of trying it out today. I'm guessing it will have big payoffs when I hit a lot more of the ancient syntax's nitty-gritties.
I just tried it out, but noticed that the .result
wasn't getting assigned with the parse tree. After a bit of stepping through LARK internals, I noticed the parser wasn't seeing EOF
. Sorted by manually calling .feed_eof()
after the iterator loop quits.
Sample code is below, and has got me exactly where I need to be :) :
def parseInteractive(self, raw, *args, **kw):
pi = self.parser.parse_interactive(raw, *args, **kw)
for token in pi.iter_parse():
if self.debug:
ctr = pi.lexer_state.state.line_ctr
line = ctr.line
column = ctr.column
tokType = token.type.split('__')[-1]
tokVal = token.value
print(f"TOKEN:{line}:{column}:{tokType}={repr(tokVal)}")
# apparently the interactive parser never sees $END, so we have to
# feed it in explicitly
pi.feed_eof()
# now we can harvest the transformed tree
result = pi.result
return result
As a takeaway, there might be merit in adding a couple of properties to the InteractiveParser
object to allow cleaner and more future-proof access to line, column and character position values, to cover cases where .lexer_state.state.line_ctr.line
etc access paths gets broken by future updates.
But for now, I'm delighted to have so much transparency in the parser's activity. Thanks again!
It looks like the recipe isn't entirely correct. To get the result, you should call pi.resume_parse()
. See example here: https://github.com/geographika/mappyfile/blob/master/mappyfile/parser.py#L218
As for the line and column numbers, why don't you just take them from the token?
Suggestion Requesting constructor keyword options to allow logging the lexer tokens stream. Also, if feasible, the potential target fulfilments in the current context.
Describe alternatives you've considered The PyCharm debugger has sophisticated breakpoint options, including the ability to set a breakpoint to:
Additional context Printing the token stream, via the above IDE debugger breakpoint technique, has been a huge support in my current project.
(FYI, this requires carefully retro-implementing a parser for an archaic, convoluted and very non-standard programming/configuration language from the 1980s, whose parser was originally implemented in hand-crafted C, incrementally coded/patched/extended in a silo over the decades, and with no formal grammar specification, not even YACC. Getting its various cryptic nuances to parse and correctly feed into my transformer is a massively challenging undertaking, but I'm getting there.)
I would really like to be able to watch or log the LARK parser's token stream without reliance on the IDE. Even if a constructor option allowed passing an open writeable file object, and/or a logger object, and/or the pathname of a file to write to, this would be very helpful.
In a perfect world, for each token fetched and logged, it would be even better to see the current line/column numbers in the input at which the token was matched.
I acknowledge that logging of parser state would be a much harder venture, especially to do so in a readable manner. So even just token stream logging would be quite a boost.