Problem parsing with LALR

SupraSummus commented 5 years ago

I'm not sure if it's fault of grammar incompatible with LALR or is it a bug. Here it goes:

Python 3.7.0
lark revision 3c1c867b5f59397757dfec65d98d96e15b7de6e2

grammar.lark (extremely boiled down dhall grammar)

whitespace_chunk: " " | "\n"
whitespace_optional: whitespace_chunk*

reserved_func: "f" whitespace_optional
open_parens: "(" whitespace_optional
close_parens: ")" whitespace_optional

expression_application: expression_parens (whitespace_chunk expression_parens)*
expression_parens: expression_reserved | (open_parens expression_application 
close_parens)
expression_reserved: reserved_func

start: whitespace_optional expression_application

input f (f f)
earley can parse (lark.Lark(open('grammar.lark'), parser='earley').parse('f (f f)'))
lalr fails to parse (lark.Lark(open('grammar.lark'), parser='lalr').parse('f (f f)')) - raises lark.exceptions.UnexpectedToken
standalone lalr also fails to parse

If this is because of this grammar being not parsable by LALR, I wonder if it would be possible to add error messages telling this. Currently I get something like "Unexpected token '(', expected one of ...". For me, ideally constructing parser from grammar of unsupported class should raise "Can't do it. Fix your grammar to be lalr.".

SupraSummus commented 5 years ago

I managed to further reduce the grammar:

import lark

grammar = """
a: "a"
as: a*
start: as a "f"
"""
input_string = 'af'

# this is ok
lark.Lark(grammar, parser='earley').parse(input_string)
# this fails to parse
lark.Lark(grammar, parser='lalr').parse(input_string)

bottom line raises

lark.exceptions.UnexpectedToken: Unexpected token Token(F, 'f') at line 1, column 2.
Expected one of: 
    * A

however this grammar behaves as expected:

lark.Lark('start: "a"* "a" "f"', parser='lalr').parse('af')

SupraSummus commented 5 years ago

also different exception is raised in this script (difference is in as rule):

import lark

grammar = """
a: "a"
as: "a"*
start: as a "f"
"""
input_string = 'af'

lark.Lark(grammar, parser='lalr').parse(input_string)

raises

lark.exceptions.UnexpectedCharacters: No terminal defined for 'f' at line 1 col 2

af
 ^

erezsh commented 5 years ago

As you already figured out, it's probably due to Shift/Reduce collisions being automatically resolved as Shift.

I just added a way to see the current collisions (issue #258 )

If you need a way to control the resolution (i.e. to force a reduce) I'll consider adding it as a feature.

I see your point about warning that a grammar isn't "LALR compatible", but at the same time, I don't want to harass the users, who 99% of the time just want to resolve to shift (It's a made up figure, but this is the first time this issue has come up in months, and PLY uses the exact same trick).

SupraSummus commented 5 years ago

Hi, thank you for response. (You are awesome with all of this patience responding to users questions btw ;-) )

Really I've got little idea of how LALR works, so I've checked my grammar against other LALR generator and there wasn't any conflicts.

generator: http://jsmachines.sourceforge.net/machines/lalr1.html input:

S -> AS A f
AS -> ''
AS -> AS A
A -> a

However this grammar has two conflicts (difference in second AS production):

S -> AS A f
AS -> ''
AS -> A AS
A -> a

So, I suspect lark is unwinding a* into right-recursive rule and it results in conflicts.

I'll try to explicitly specify rules for as (without *) and see if that helps.

erezsh commented 5 years ago

Thanks! Helping users is one of the best parts :)

Lark unwinds a* into a left-recursive rule.

I have to say, I know the LALR algorithm in & out by now, and I still find "LR tables" confusing. I think it's a lot easier to understand as a stack of states, where a state is a set of rule-pointers (a rule-pointer is a dot that signifies position in a specific rule (like a counter). It's a little harder to visualize than a table, but I think it's easier to reason about once it clicks.

So, your grammar:

a: "a"
as: a*
start: as a "f"

actually expands into:

a: A
_as: _as a
      | a
as: _as
    |
start: as a "f"

(You could make _as the empty rule, but I found that it's faster to take it out of the recursion)

So when you input A, the grammar automatically shifts _as, just because it can, although the correct action would have been to reduce the empty as rule. And then it finds itself in an "invalid" state, in the sense that it can't parse the rest of your input.

The same idea happens if you input multiple As (but separate, which is why there are two conflicts). Whenever it sees A it stays in the loop, and it needs to see something different to know it has to reduce.

Btw, changing it to "always reduce" is just going to create other inputs that fail to parse.

So without more knowledge about the problem you're trying to solve, my suggestion is to change your grammar, perhaps by abstracting away these collisions and handling them later in the tree (i.e. post-processing), or using Earley.

SupraSummus commented 5 years ago

After long battle against LALR parsing tables I think I've lost. The language I'm trying to encode may not be LALR or maybe I'm just not understanding LALR enough. Currently I'm switching to GLR parser and it seems a lot easier (and slower probably, but still faster than earley).

@erezsh thank you very much for support and explaining things ;)

erezsh commented 5 years ago

You can use Earley. It's as capable as GLR, though perhaps not as fast.

After long battle against LALR parsing tables

As I said, they aren't a good way to understand the algorithm.

lark-parser / lark

Problem parsing with LALR #256