igordejanovic / parglare

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
MIT License
135 stars 32 forks source link

Swallowing up text in the parser #122

Open stuartlangridge opened 4 years ago

stuartlangridge commented 4 years ago

I have a document which contains a heading, which is a quoted string, and then a series of "sentences" which end with a "." and may have newlines in. I'd like to parse the document into Heading and Sentences. I tried to do it this way:

import parglare

grammar = r"""
Document: Heading Body;

Heading: QuotedString;
Body: Anything;

Sentence: Anything DOT;

terminals

QuotedString: /"(?P<qs>.*?)"/;
Anything: /.*/;
DOT: ".";
"""

text = """

"This is the heading"

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

However, this fails with parglare.exceptions.ParseError: Error at 6:0:"ence one.\n **> This is se" => Expected: DOT but found <Anything(This is sentence two)>.

All I care about is the Heading, and parsing the Body into separate sentences, but I can't work out how to do that; what's the best way to express this in a parglare grammar? The sentences can contain anything at all; I don't need a structure or parsing for them at this stage, just a list with ["This is sentence one.", "This is sentence two which has newlines in."] as the return; sentences might contain any characters at all.

(Apologies if this isn't actually an issue, but I hope it's the best place to ask questions about parglare. I'm happy to ask it somewhere else if that's better.)

igordejanovic commented 4 years ago

In Python regexes . by default don't cross line boundaries. To change that you can use ?s inline flag (see re.DOTALL in the Python docs). So your grammar will work correctly with this:

Anything: /(?s).*/;
igordejanovic commented 4 years ago

BTW, here is the right place to ask questions about parglare.

stuartlangridge commented 4 years ago

Ah, now, I tried (?s) (this bug report was originally going to mention DOTALL until I actually read the re documentation and discovered the inline (?s) version, which I didn't know existed :-)) but when I tried it I still got errors, presumably because I don't quite understand it. Example:

import parglare

grammar = r"""
Program: al=AuthorLine sentences=Sentences;
AuthorLine: title=Identifier "by" author=Identifier DOT;

Sentences: Sentence*;
Sentence: Anything DOT;
Identifier: IdentifierWord*;

terminals

IdentifierWord: /\w+/;
DOT: ".";
Anything: /(?s).*?/;
"""

text = """
Program by Stuart.

This is sentence one.
This is sentence two
which has newlines in.
"""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True)
result = p.parse(text)

This fails with error: parglare.exceptions.ParseError: Error at 4:0:" Stuart.\n\n **> This is se" => Expected: Anything or STOP but found <IdentifierWord(This)>

I don't know how to tell parglare "just swallow up the rest of the document, I don't care about parsing it", or "please only detect an IdentifierWord in the context of an AuthorLine and once you've got the AuthorLine, stop parsing" -- I can't boost or decrease the relevance of IdentifierWord with {1} or {99} because it's a terminal, and even then I want to boost it while parsing an AuthorLine and decrease it when not, which I don't understand how to do. Maybe I'm attacking this problem completely the wrong way?

igordejanovic commented 4 years ago

The problem is that Anything collects... well anything, even dots :) so Sentence rule never match as it expect DOT after Anything. You can do this:

Anything: /(?s)[^\.]*/;

which means Anything is anything except dot.

Another feature you might find useful, depending on what you are trying to achieve, is incomplete parsing.

stuartlangridge commented 4 years ago

Incomplete parsing looks like exactly what I want! Thank you!