Open DRMacIver opened 1 year ago
Lark isn't made for scanning text. However that question has been asked many times. To my recollection, there are two ways about this -
1) Utilize Earley with a prefix and postfix of /.*/
. As you mentioned, this is very inefficient for large inputs. However, it's relatively simple to implement.
2) Attempt to parse at every possible starting position, until a successful parse is achieved. Can be done with either Lark or Earley. Not too difficult to do either, and can be reasonably efficient, if there's a low number of "false positives".
Similar issues:
Ah, apologies. I should have searched previous issues for duplicates.
I don't think either of those are likely to be viable for me unfortunately, as I need this to work on quite large inputs, so I think I'm back to implementing my own.
Thanks for the quick response!
If all your terminals in the grammar are non-conflicting (i.e. using lexer='basic'
works) then it should be very doable to implement a quite efficient search
method.
I created a small branch of Lark
that includes a .scan
method that might be able to handle what you want. If it generally does what you want, we could look into polishing it up a bit. Most importantly, it currently creates copies of the input text: https://github.com/MegaIng/lark/tree/add-scanning
This small examples works:
grammar = r"""
template: "{{" TEXT ("|" argument)* "}}"
argument: TEXT "=" TEXT -> named_argument
| TEXT -> numbered_argument
TEXT: /[^={}|]+/
"""
parser = lark.Lark(grammar, start="template", parser="lalr")
for result in parser.scan(open("python.wiki", encoding="utf-8").read()):
print(result)
for result in parser.scan("{{a}}{{b}}"):
print(result)
(it extracts templates from wikitext)
@MegaIng Btw, if you tidy up that implementation a bit and show that it passes a few stress tests, I'm inclined to add it as an official API.
I'm also curious if in the search implementation, maybe doing text = text[:best]
might improve the performance.
@MegaIng Btw, if you tidy up that implementation a bit and show that it passes a few stress tests, I'm inclined to add it as an official API.
Will do at some point.
I'm also curious if in the search implementation, maybe doing
text = text[:best]
might improve the performance.
I think it would be better to avoid creating copies of the input string. An non-absurd usecase for this feature is search in log files that might be multiple 100 MB large. Instead I would like to add start
and end
all over our parse apis, which also first easily into regex since those apis all also accept those things. Potentially it might even make sense to expose our LineCounter
class and make start
and end
accept that as a parameter so that the line numbers are correct.
Suggestion
I would be like to be able to use Lark to search for all substrings of a string matching some Lark grammar. e.g. something along the lines of the following method on the Lark class:
(I'm not wed to this exact interface)
Describe alternatives you've considered
Currently my least bad alternative for something like this is to write my own specialised parser tool, which I'd rather not do.
The more likely solution I'd adopt is just use regex and hope I can work around the cases where I need a better grammar than that supports.
I've tried doing this with Lark using a heavily ambiguous grammar that has just arbitrary amounts of noise at the beginning and end and ambiguous parse forests. It doesn't seem to work very reliably (and I expect is highly inefficient as a solution), but I may investigate this option further.
Additional context
If you're curious the specific use case I have for this is in implementing test-case reduction (delta debugging, C-reduce, etc). The idea is to have lots of little grammars that define common structures that might appear in various file formats (e.g. arithmetic expressions) and then parse out the parts of the file that match those grammars and make transformations to only that region, based on the parse. This means that I need to be able to parse arbitrary unknown inputs, while only caring about the bits of them that match a particular grammar.
I don't really have a good sense of Lark internals (though I've happily used it a few times, and am... certainly not great at parsing theory, but also not completely ignorant), so I don't know whether where what I'm asking falls on a spectrum of "basically trivial" to "literally impossible with the current implementation". I'm happy to have a go at the implementation work if there's no other interest in working on this, but would appreciate some pointers as to where to start and how difficult it's likely to be.