lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.82k stars 409 forks source link

question: is incremental parsing possible? #488

Open wbsoft opened 4 years ago

wbsoft commented 4 years ago

I have the use case where I need to parse a text line by line. At the first line, the parser can start at its start position. At the end of the line, the parser should store its state (e.g. in which context is is) in a hashable object. When parsing any line that is not the first line, the state of the end of the previous line is restored and the current line is parsed.

(This is how QSyntaxHighlighter (from Qt) is organized. When the user types in a certain line, that line is re-highlighted. And when its state at the end is different than it was beforehand, the next line is also re-highlighted, and so on.)

Would this be possible using the lark parser? From my own experimentation and docs reading I can only parse a full text (or file) in one go.

wbsoft commented 4 years ago

(I wrote the simple module slexer.py that can do this years ago, but maintaining large grammars with that module is becoming quite tedious, and I would really like to use lark instead of my own module.)

erezsh commented 4 years ago

Lark currently doesn't support such a feature.

But, it shouldn't be too hard to implement, at least for the LALR algorithm.

wbsoft commented 4 years ago

Thanks for your quick reply :-)

I was already trying to use the serialize() method but that complained about a missing '__serialize_fields__' attribute.

erezsh commented 4 years ago

Serialize would work, but it would be very inefficient, because it would reconstruct all the instances.

Still, if you want to use it, see the correct usage here: https://github.com/lark-parser/lark/blob/master/lark/tools/standalone.py#L105

MegaIng commented 4 years ago

@erezsh This is an interesting use case for everything we are currently talking about.

erezsh commented 4 years ago

Yes, I agree. It should be possible to let the parser save its state at certain points, using a "puppet" (or a capsule, or however we name it), and then allow the user to resume from either one of the save points.

It will affect performance, but it might be good enough for the purpose of an IDE.

I'm still not sure what's the best interface to specify these save-points.

rec commented 4 years ago

:+1: for this issue.

I could use a much simpler API even - example:

with open(my_file) as lines:
    parser = Lark(...)
    for data in parser.parse_incrementally(lines):
        my_process(data)

Give it an iterator of lines - result is an iterator of data.

The typical example is reading a long file containing a large number of small JSON or JSON-like data. I want to feed in each line and have the results come out as they are finished, not have to read the whole file into memory before the first one comes out!

Or it might be a long-running socket connection....

MegaIng commented 4 years ago

The problem is what is data?, e.g. which element of the grammar is it?

erezsh commented 4 years ago

@rec I think you misunderstood what incremental parsing means.

This issue is about storing "savepoints" while parsing, and being able to restart the run from these savepoints. For example, if the input has changed. That could be useful for IDEs.

What I think you're asking for, which is to parse a large file without having to load all of it into memory, is something that Lark currently doesn't do.

edit: Removed incorrect claims and bad advice.

rec commented 4 years ago

Gosh, you guys are responsive. :-) Thanks!

Megalng: The "data" is the top-level terminal in the grammar.

erezsh: Brilliant answer!

I was somewhat aware that my "incremental" parsing wasn't what was described here. The literature doesn't really have any consistent name for what I was describing and some people call it incremental parsing.

The solution you present seems good except that the code has to re-read my_grammar.lark for each transaction.

If I knew that the previous parser had just successfully parsed a top-level terminal, then I could just re-use the previous parser, but it might have failed, or I might be reading two separate streams in parallel.

I need to deliver a proof-of-concept, though, so this is just not a big deal!

Another hit, another home run(*) from the lark team. Thanks again!


(* - or insert favorite sport here!)

MegaIng commented 4 years ago

And we will be even more responsive :-P.

The "data" is the top-level terminal in the grammar.

Although erezsh did correctly spot the misunderstanding, this might not work since the top-level structure is not completly finished after one line. And if we parse multiple lines we have no benefits over just parsing everything in one action.

The solution you present seems good except that the code has to re-read my_grammar.lark for each transaction.

No. The Lark object should be completely save to reuse, even in a multi-threaded environment. Contradictory behavior is a bug.

erezsh commented 4 years ago

@rec I don't think I fully understood your use-case. But please, let's move this to a new issue (feel free to open one, and express exactly what you're after), or continue on gitter.

jspaezp commented 2 years ago

I think I understand what he wants. (and I feel like I want the same thing).

I think the example would be that ... if you had a 60gb json file, which is just a list of requests (each being very small); Is there a way to iterate over the file and have the parser return one request at a time, without parsing/lexing all of them at once.

Another example I can think of is log files, which can be massive but each entry is fairly small.

In other words, if the text being parsed is just repetitions of a single element type, is there a way to have the parser work as a generator for such elements? Do you have any idea on how this could be done?

It would be great to have something like ...

parser = Lark(grammar, transformer=MyTransformer(), iter='dict') # Specify what rules should be yielded

# Assuming the contents of a json file are 
with open('massive_file.json', 'r') as f:
    for my_entry in parser.iter_parse():
        some_random_process(my_entry)

I am so far loving the package and I can see how this could be handy. Thank you so much for the hard work!

rec commented 2 years ago

Sorry for not replying before, I suddenly started a new job.

Yes, you have exactly my use case - a large file with a large number of small items.

By the way, we are now using Lark in that new job, and like always, it worked flawlessly and I didn't even have to get involved to help!

MegaIng commented 2 years ago

The existing interactive parser can be used to create a small wrapper that does this.

from queue import Queue

from lark import Discard, Lark

json_grammar = r"""
    ?start: "[" [command ("," command)*] "]"
    command: value

    ?value: object
          | array
          | string
          | SIGNED_NUMBER      -> number
          | "true"             -> true
          | "false"            -> false
          | "null"             -> null

    array  : "[" [value ("," value)*] "]"
    object : "{" [pair ("," pair)*] "}"
    pair   : string ":" value

    string : ESCAPED_STRING

    %import common.ESCAPED_STRING
    %import common.SIGNED_NUMBER
    %import common.WS

    %ignore WS
"""

class Transformer:
    def __init__(self, callback):
        self.callback = callback

    def command(self, children):
        self.callback(children[0])
        return Discard

def iter_parser(*args, transformer, **kwargs):
    queue = Queue()
    if not kwargs.setdefault("parser", "lalr") == "lalr":
        raise ValueError("The lalr parser is required")
    kwargs['transformer'] = transformer(queue.put)
    parser = Lark(*args, **kwargs)

    def parse(text, start=None):
        interactive = parser.parse_interactive(text, start)
        token = None
        for token in interactive.iter_parse():
            while not queue.empty():
                yield queue.get()
        interactive.feed_eof(token)
        while not queue.empty():
            yield queue.get()
    return parse

p = iter_parser(json_grammar, parser="lalr", transformer=Transformer)

test_text = """
[
 {"command": "print", "args": ["argument", 0, {"some": "object"}]},
 {"command": "input", "args": ["some prompt"]}
]
"""

for c in p(test_text):
    print("got", c)
rec commented 2 years ago

Super, mega-cool!

jspaezp commented 2 years ago

Im not sure if I am more impressed by the response time or the actual solution ... This is amazing! I definitely feel like this could be part of the tutorials. Right now I do not have time to write it myself and submit a PR, but if there is interest I can give it a go at a later time point.