Open swamidass opened 2 years ago
I previously wrote a small function that does pretty much exactly this: https://github.com/lark-parser/lark/issues/488#issuecomment-1235162911
No real knowledge of the internals is required, it might just be that a bit of this stuff is missing documentation.
To get exactly the behavior you want, it just needs a small custom lexer.
Thanks for linking to that! Looks close, but not quite there. As written, doesn't it assume you have the full text read into memory? How would you rework it to use a stream?
(Note: Having the full text in memory is almost surely not a problem, unless you are literally parsing GBs of data)
A CustomLexer can be created that takes in a callable that provides the next chunk of text. This can then be handed of to a base lexer. Assumptions like that tokens will never surpass a line boundary (excluding explicit newlines) greatly help with implementation, but aren't required. I could see if I can create a short example.
Ok, I misremembered how to best stream data into the lexer, it's a bit more complicated than I thought. But this should work, although it's not tested a lot. Maybe this should at the very least be somewhere in the examples.
def parse_from_stream(parser, provider, start=None):
"""
Parses iteratively from a potentially huge stream, throwing away text that is no longer needed.
Uses the interactive parser interface, as well as implementation details of the LexerState objects.
parser is a Lark instance
provider is a string producer with the signature `() -> str`, for example via `partial(file.read, 64)`
start get's passed to the Lark instance
"""
def ensure_preview(lexer_state: LexerState):
"""
Ensure, if possible, that the current text buffer contains at least 2 newlines after the current position
As long as there are no tokens that are longer than 2 newlines, this will make sure that the lexer
doesn't encounter any unexpected problems.
TODO: The line counter isn't 100% correctly being adjusted.
"""
if lexer_state.text[lexer_state.line_ctr.char_pos:].count("\n") > 2: # More than 2 newlines remaining
return False
throw_away = lexer_state.text[:max(lexer_state.line_ctr.char_pos - 5, 0)]
keep = lexer_state.text[max(lexer_state.line_ctr.char_pos - 5, 0):]
lexer_state.line_ctr.char_pos -= len(throw_away)
new = provider()
lexer_state.text = keep + new
interactive = parser.parse_interactive("", start)
token = None
while True:
ensure_preview(interactive.lexer_thread.state)
try:
token = next(interactive.lexer_thread.lex(interactive.parser_state))
except StopIteration:
break
else:
interactive.feed_token(token)
yield token
interactive.feed_eof(token)
def iter_parser_stream(*args, transformer, **kwargs):
queue = Queue()
if not kwargs.setdefault("parser", "lalr") == "lalr":
raise ValueError("The lalr parser is required")
kwargs['transformer'] = transformer(queue.put)
parser = Lark(*args, **kwargs)
def parse(stream, start=None):
for token in parse_from_stream(parser, stream, start):
while not queue.empty():
yield queue.get()
while not queue.empty():
yield queue.get()
return parse
Hello @swamidass and @MegaIng ,
I wonder if maybe the best solution is to use mmap.
I imagine it's the most efficient option, and we already support bytes
as input, so it will probably only require minor changes.
What do you think?
But if unicode is a must, I don't see a better approach than what MegaIng presented, which is to constantly grow your text buffer as the lexer advances. The obvious downside to that, is that each next token must be completely within that buffer, or the parse will fail.
Using the regex
library instead of the stdlib re
library, it might be possible to make use of partial matches to know whether or not you need more buffer area. But that would certainly require a decently complex custom lexer.
https://stackoverflow.com/questions/4634376/python-regex-parse-stream
In summary it seems this functionality is technically feasible but not easy to implement. It's useful too, so adding the feature to the main library seems worth while.
Suggestion
Sometimes we need to parse and transform a very large stream that can't be fit into memory all at once. Lark has a way of transforming a tree from a parser, without constructing the tree: https://lark-parser.readthedocs.io/en/latest/json_tutorial.html#part-5-optimizing , using this API:
The problem here is that f.read() requires reading the whole string into memory. Ideally, the API would allow something like this:
Here, TreeToJson could be implemented to (for example) emit objects that match a jsonpath pattern. This would require extending the transformer API so that the transformer has a way of notifying the parser when an object is ready to be emitted.
Less ideally, but still usable, syntax like this could be used. The advantage of this would be that the parser-transformer internals might not need to be changed much at all. Though it would require users of the library to be clever enough in how they write the transformer object to make this possible:
If this could be done, it would make Lark capable of parsing infinite inputs (e.g. an infinite stream of JSON objects) with limited memory and all the benefits to performance that come from this pattern.
Describe alternatives you've considered
It might be possible to implement this as a function or library that operates on top of lark. However, such a function would require some intimate knowledge of the non-public API of lark to work. The ideal situation would be for Lark itself to implement this function.