lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

False shift/reduce conflicts? #1314

Closed davidmcnabnz closed 1 year ago

davidmcnabnz commented 1 year ago

Describe the bug

Shift/Reduce Conflict is being incorrectly detected. The grammar is definitely not ambiguous.

To Reproduce

Create an LALR parser on the grammar below, with debug turned on. Feed it with the grammar:

?start: section+
section: section_header section_item+
section_header: NAME ":"
section_item: NAME "=" NUMBER
NAME: /[a-z][a-z_0-9]*/

%import common.WS_INLINE
%import common.NUMBER
%ignore WS_INLINE

The LALR parser analyser is claiming that the terminal NAME presents a shift/reduce conflict for the rule:

 * <section : NAME COLON __section_plus_1>

The grammar is intending to parse text input such as:

section1:
 x = 22
 y = 19

section2:
 a = 44
 b = 6

The grammar provides only two legal places where a NAME terminal may appear:

  1. immediately before a colon, versus
  2. immediately before an equals sign

Is this a bug in Lark? Or is there a subtle ambiguity I'm missing in this grammar?

Either way, can anyone please advise a way to resolve or work around this?

davidmcnabnz commented 1 year ago

Following up, I've found a workaround. Since the files I'm parsing all have the colons immediately after the section titles, no spaces, I am now using a separate terminal which specifically includes the trailing colon.

?start: section+
section: section_header section_item+
section_header: SECTION_NAME -> section_header
section_item: NAME "=" NUMBER

NAME: /[a-z][a-z_0-9]*/
SECTION_NAME: /[a-z][a-z_0-9]*:/

%import common.WS_INLINE
%import common.NUMBER
%ignore WS_INLINE

Then, in the transformer's section_header() handler method, I just strip off the trailing colon.

This feels like a bit of a hack, but at least it's got me around the immediate issue.

erezsh commented 1 year ago

The grammar is not ambiguous as a CFG, but it is "ambiguous" in LALR(1) because there is only 1 token lookahead. Namely, at the end of each section, if it sees NAME it has to choose between starting a new section, or starting a new section_item.

Your solution is adequate, and is probably the easiest way to get around this problem.

davidmcnabnz commented 1 year ago

@erezsh Thanks for your review and reply.

Naively, I was unaware that LALR(1) could only look ahead 1 token. Is the '1 token' lookahead an inherent design limitation of LALR(1)? Or is it possible to get it looking further ahead.

I did try Earley, but this doesn't support the use of a Transformer during the parse.

erezsh commented 1 year ago

Using a transformer during the parse is mostly used for performance. If performance isn't a big issue, you might not need it.

Yes, it's an inherent limitation of LALR(1). Unfortunately Lark doesn't implement LALR(2) or above.

However, you could probably emulate it by using the interactive parser, since you can lookahead "manually" at each step of the parse, and influence the decisions it takes.

But, that's probably a very big overkill, if your problem is as simple as you described it.

davidmcnabnz commented 1 year ago

I'll take a good look at the 'interactive parser', since its power may be needed when I add in the more complex elements of the (inherited) language I'm needing to parse.

Thanks again. I'm grateful that LARK is a live project, with responsive support that keeps it at the top of the Python general parsing ecosystem.

erezsh commented 1 year ago

Thanks for the compliments!

For a real-life example of the interactive parser, have a look at this: https://github.com/geographika/mappyfile/blob/master/mappyfile/parser.py#L205