igordejanovic / parglare

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
MIT License
136 stars 32 forks source link

A terminal that doesn't match specific strings #123

Open stuartlangridge opened 4 years ago

stuartlangridge commented 4 years ago

I'd like to define a terminal that matches words except specific words.

This is why: trying this code

import parglare

grammar = r"""
Sentence: The? object_name=Identifier "is" A Identifier DOT;
Identifier: IdentifierWord+;

terminals

The: /(?i)The/;
A: /(?i)An?/;
IdentifierWord: /\w+/;
DOT: ".";
"""

text = """The apple is a fruit."""

g = parglare.Grammar.from_string(grammar)
p = parglare.Parser(g, debug=True, consume_input=False)
result = p.parse(text)
print(result)

fails, expectedly, with Can't disambiguate between: <IdentifierWord(The)> or <The(The)>, because IdentifierWord matches everything. So what I'd like to do is have IdentifierWord not match certain things, such as "the" and "a". However, when I try this, by changing the definition of the IdentifierWord terminal to IdentifierWord: /(?!The|a)\w+/; so that it uses a negative lookahead to exclude certain words from matching, then the above code fails with

Error at 2:4:"\nThe **> apple is a" => Expected: IdentifierWord but found <A(a)>

I don't understand why this is. It's finding the "a" at the beginning of "apple" and treating it as an "a". I don't know if I'm solving this the best way; is there some other way I should be structuring this sort of grammar, or maybe some better way of defining a terminal that matches all words except certain ones?

igordejanovic commented 4 years ago

Word apple is not matched by (?!The|a)\w+. It is because the negative assertion will match a at the beginning. What you need to do it to make sure that the negative assertion take into account the word boundary. Try this (?!(The|a)\b)\w+.

stuartlangridge commented 4 years ago

aha! Again, much appreciated; I understand now what I was doing wrong. Thank you!