igordejanovic / parglare

A pure Python LR/GLR parser - http://www.igordejanovic.net/parglare/
MIT License
136 stars 32 forks source link

Lexical disambiguation not working as expected #146

Open bramhaag opened 1 year ago

bramhaag commented 1 year ago

Description

I am trying to parse a language with a large amount of ambiguity. To resolve the ambiguity, I have assigned priorities to my terminals. It would appear that not all parses are considered, and valid input fails because of that.

What I Did

I've reduced my grammar to the following:

grammar.pg ``` Program: Stmt+ dot; Stmt: (add | ADD) id+ (to | TO) id; terminals dot: "."; ADD: "ADD" {10}; TO: "TO" {10}; id: /[a-zA-Z]+/ {7}; add: "add" {5}; to: "to" {5}; ```

The input I'm trying to parse is: ADD A to B ADD C TO D ADD E to F.. In this language, there are no reserved keywords and the keywords are not case sensitive, nor is there a required statement terminator character. The disambiguation strategy that I've tried to implement using lexing priorities:

  1. Uppercase words are considered keywords by default (priority 10)
  2. All other words are considered identifiers (priority 7)
  3. If a keyword is required, lowercase words should be considered a keyword (priority 5)

According to my strategy, I expect the following result: [ADD [A to B ADD C] TO D], [ADD [E] to F]., or at least any result.

Instead, I get the following error:

parglare.exceptions.ParseError: Error at 1:32:"ADD E to F **> ." => Expected: TO or id or to but found <dot(.)>

What I expected to happen is that once trying to use the final to as an identifier failed, the parser would retry it as a keyword instead, as this terminal has a lower priority. This never seems to happen however. Is this intentional, and if so, how could I best implement my disambiguation strategy?

If I use the GLRParser class and give the id and to terminals the same priority, I do get a parse forest with two paths, one of which is the expected one.

igordejanovic commented 1 year ago

LR parsing is deterministic, i.e. it never backtracks and the implementation used in parglare is LALR(1) so the parser always decide what to do by looking at just one token of lookahead. GLR is nondeterministic, it will investigate all possibilities. Thus, when you need larger lookahead to disambiguate or if your language is inherently ambiguous you must use GLR.

Lexing is done during parsing and is the same for both parsing strategies.

When the parser is at this position:

ADD E * to F.

It will resolve to as id given that the id has higher priority. The parser doesn't look at the rest of the input behind to. Thus, to terminal will never be recognized within this configuration as lexical disambiguation is done at the time of recognition.

If priority is used for terminals lexical disambiguation will always chose terminal of higher priority regardless whether LR or GLR is used. So the proper strategy is, as you have already done, to remove priorities in id and to and let GLR investigate all solutions. Then, you can use dynamic disambiguation strategies to remove invalid trees from the forest.

There is a plan to implement dynamic disambiguation strategies that will be specified in the grammar and be used to remove less desired trees from the forest when there is more solutions but I don't know when I'll find time to work on it.