lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.92k stars 417 forks source link

A streamable parser #1211

Open swamidass opened 2 years ago

swamidass commented 2 years ago

Suggestion

Sometimes we need to parse and transform a very large stream that can't be fit into memory all at once. Lark has a way of transforming a tree from a parser, without constructing the tree: https://lark-parser.readthedocs.io/en/latest/json_tutorial.html#part-5-optimizing , using this API:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=TreeToJson()) print( json_parser.parse(f.read()) )

The problem here is that f.read() requires reading the whole string into memory. Ideally, the API would allow something like this:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=TreeToJson()) for emitted_object in json_parser.stream(readable_stream): work(emitted_object)

Here, TreeToJson could be implemented to (for example) emit objects that match a jsonpath pattern. This would require extending the transformer API so that the transformer has a way of notifying the parser when an object is ready to be emitted.

Less ideally, but still usable, syntax like this could be used. The advantage of this would be that the parser-transformer internals might not need to be changed much at all. Though it would require users of the library to be clever enough in how they write the transformer object to make this possible:

json_parser = Lark(json_grammar, start='value', parser='lalr', transformer=transformer) json_parser.stream(readable_stream) for emitted_object in iter(transformer): work(emitted_object)

If this could be done, it would make Lark capable of parsing infinite inputs (e.g. an infinite stream of JSON objects) with limited memory and all the benefits to performance that come from this pattern.

Describe alternatives you've considered

It might be possible to implement this as a function or library that operates on top of lark. However, such a function would require some intimate knowledge of the non-public API of lark to work. The ideal situation would be for Lark itself to implement this function.

MegaIng commented 2 years ago

I previously wrote a small function that does pretty much exactly this: https://github.com/lark-parser/lark/issues/488#issuecomment-1235162911

No real knowledge of the internals is required, it might just be that a bit of this stuff is missing documentation.

To get exactly the behavior you want, it just needs a small custom lexer.

swamidass commented 2 years ago

Thanks for linking to that! Looks close, but not quite there. As written, doesn't it assume you have the full text read into memory? How would you rework it to use a stream?

MegaIng commented 2 years ago

(Note: Having the full text in memory is almost surely not a problem, unless you are literally parsing GBs of data)

A CustomLexer can be created that takes in a callable that provides the next chunk of text. This can then be handed of to a base lexer. Assumptions like that tokens will never surpass a line boundary (excluding explicit newlines) greatly help with implementation, but aren't required. I could see if I can create a short example.

MegaIng commented 2 years ago

Ok, I misremembered how to best stream data into the lexer, it's a bit more complicated than I thought. But this should work, although it's not tested a lot. Maybe this should at the very least be somewhere in the examples.

def parse_from_stream(parser, provider, start=None):
    """
    Parses iteratively from a potentially huge stream, throwing away text that is no longer needed.
    Uses the interactive parser interface, as well as implementation details of the LexerState objects.

    parser is a Lark instance
    provider is a string producer with the signature `() -> str`, for example via `partial(file.read, 64)`
    start get's passed to the Lark instance
    """

    def ensure_preview(lexer_state: LexerState):
        """
        Ensure, if possible, that the current text buffer contains at least 2 newlines after the current position
        As long as there are no tokens that are longer than 2 newlines, this will make sure that the lexer
        doesn't encounter any unexpected problems.
        TODO: The line counter isn't 100% correctly being adjusted.
        """
        if lexer_state.text[lexer_state.line_ctr.char_pos:].count("\n") > 2:  # More than 2 newlines remaining
            return False
        throw_away = lexer_state.text[:max(lexer_state.line_ctr.char_pos - 5, 0)]
        keep = lexer_state.text[max(lexer_state.line_ctr.char_pos - 5, 0):]
        lexer_state.line_ctr.char_pos -= len(throw_away)
        new = provider()
        lexer_state.text = keep + new

    interactive = parser.parse_interactive("", start)
    token = None
    while True:
        ensure_preview(interactive.lexer_thread.state)
        try:
            token = next(interactive.lexer_thread.lex(interactive.parser_state))
        except StopIteration:
            break
        else:
            interactive.feed_token(token)
            yield token
    interactive.feed_eof(token)

def iter_parser_stream(*args, transformer, **kwargs):
    queue = Queue()
    if not kwargs.setdefault("parser", "lalr") == "lalr":
        raise ValueError("The lalr parser is required")
    kwargs['transformer'] = transformer(queue.put)
    parser = Lark(*args, **kwargs)

    def parse(stream, start=None):
        for token in parse_from_stream(parser, stream, start):
            while not queue.empty():
                yield queue.get()
        while not queue.empty():
            yield queue.get()

    return parse
erezsh commented 2 years ago

Hello @swamidass and @MegaIng ,

I wonder if maybe the best solution is to use mmap.

I imagine it's the most efficient option, and we already support bytes as input, so it will probably only require minor changes.

What do you think?

erezsh commented 2 years ago

But if unicode is a must, I don't see a better approach than what MegaIng presented, which is to constantly grow your text buffer as the lexer advances. The obvious downside to that, is that each next token must be completely within that buffer, or the parse will fail.

MegaIng commented 2 years ago

Using the regex library instead of the stdlib re library, it might be possible to make use of partial matches to know whether or not you need more buffer area. But that would certainly require a decently complex custom lexer.

https://stackoverflow.com/questions/4634376/python-regex-parse-stream

swamidass commented 2 years ago

In summary it seems this functionality is technically feasible but not easy to implement. It's useful too, so adding the feature to the main library seems worth while.