lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

weird terminal priority? #1256

Open ornariece opened 1 year ago

ornariece commented 1 year ago

(not 100% sure this is a bug).

import lark

parser = lark.Lark(
    """
%ignore WS_INLINE

%import common (WS_INLINE)

DIGIT: "0" | "1"

start: a a b

a: DIGIT
b: "0"
    """,
    parser='lalr',
)

parser.parse('0 0 0')

errors to

lark.exceptions.UnexpectedToken: Unexpected token Token('__ANON_0', '0') at line 1, column 3.
Expected one of: 
    * DIGIT

which would mean that the first 0 gets properly matched to a DIGIT, but the second one isn't?

this might not seem a big deal, but this becomes problematic when doing imports: in my case, i'm using SIGNED_NUMBER in two differents imports, which means they are considered different terminals; and because of this weird terminal priority behavior, i get a similar error.

erezsh commented 1 year ago

Yeah, it's a bit of a confusing situation. For example, this works:

%ignore WS_INLINE

%import common (WS_INLINE)

DIGIT: "0" | "1"

start: a b b

a: DIGIT
b: /0/

because now DIGIT doesn't appear in the follow-set of a, where as in your example both DIGIT and "0" do.

So, I'm not sure if there's a quick solution to this issue. If you can't switch to Earley, try to restructure your grammar so that only one of the terminals is in the follow-set of the problematic rule, and not both.

ornariece commented 1 year ago

i cannot switch to earley for performance reasons. and there's no way around having such a rule in my grammar, sadly. so my only workaround is to define both a and b in the same grammar file, so that i avoid a differentiation between the terminal used by a and the terminal used by b - a differentiation caused by the terminal being defined in an grammar that is imported.

as i said, i'm using SIGNED_NUMBER in two differents grammar files, which means they are considered different terminals. so i guess another solution to my problem would be to refactor the way terminals are imported? currently, if SIGNED_NUMBER were to be imported in grammar g1 and also in grammar g2, it would become a terminal called g1__SIGNED_NUMBER, and another called g2__SIGNED_NUMBER. now if a uses g1__SIGNED_NUMBER and b uses g2__SIGNED_NUMBER, the error happens... while they are in fact using the same terminal; there should be no conflict.

erezsh commented 1 year ago

Yeah, I agree, I don't see a reason why we shouldn't merge them into the same terminal. It might raise some namespacing issues, for anyone relying on the terminal names. But if that's the case, I think there's a way to solve that too.