Open stuartlangridge opened 4 years ago
In Python regexes .
by default don't cross line boundaries. To change that you can use ?s
inline flag (see re.DOTALL
in the Python docs). So your grammar will work correctly with this:
Anything: /(?s).*/;
BTW, here is the right place to ask questions about parglare.
Ah, now, I tried (?s)
(this bug report was originally going to mention DOTALL
until I actually read the re
documentation and discovered the inline (?s)
version, which I didn't know existed :-)) but when I tried it I still got errors, presumably because I don't quite understand it. Example:
import parglare
grammar = r"""
Program: al=AuthorLine sentences=Sentences;
AuthorLine: title=Identifier "by" author=Identifier DOT;
Sentences: Sentence*;
Sentence: Anything DOT;
Identifier: IdentifierWord*;
terminals
IdentifierWord: /\w+/;
DOT: ".";
Anything: /(?s).*?/;
"""
text = """
Program by Stuart.
This is sentence one.
This is sentence two
which has newlines in.
"""
g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)
This fails with error:
parglare.exceptions.ParseError: Error at 4:0:" Stuart.\n\n **> This is se" => Expected: Anything or STOP but found <IdentifierWord(This)>
I don't know how to tell parglare "just swallow up the rest of the document, I don't care about parsing it", or "please only detect an IdentifierWord
in the context of an AuthorLine
and once you've got the AuthorLine
, stop parsing" -- I can't boost or decrease the relevance of IdentifierWord
with {1}
or {99}
because it's a terminal, and even then I want to boost it while parsing an AuthorLine
and decrease it when not, which I don't understand how to do. Maybe I'm attacking this problem completely the wrong way?
The problem is that Anything
collects... well anything, even dots :) so Sentence
rule never match as it expect DOT
after Anything
. You can do this:
Anything: /(?s)[^\.]*/;
which means Anything
is anything except dot.
Another feature you might find useful, depending on what you are trying to achieve, is incomplete parsing.
Incomplete parsing looks like exactly what I want! Thank you!
I have a document which contains a heading, which is a quoted string, and then a series of "sentences" which end with a "." and may have newlines in. I'd like to parse the document into Heading and Sentences. I tried to do it this way:
However, this fails with
parglare.exceptions.ParseError: Error at 6:0:"ence one.\n **> This is se" => Expected: DOT but found <Anything(This is sentence two)>
.All I care about is the Heading, and parsing the Body into separate sentences, but I can't work out how to do that; what's the best way to express this in a parglare grammar? The sentences can contain anything at all; I don't need a structure or parsing for them at this stage, just a list with
["This is sentence one.", "This is sentence two which has newlines in."]
as the return; sentences might contain any characters at all.(Apologies if this isn't actually an issue, but I hope it's the best place to ask questions about parglare. I'm happy to ask it somewhere else if that's better.)