lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.88k stars 414 forks source link

debugging help: lark.exceptions.UnexpectedCharacters: No terminal defined #456

Closed delzac closed 5 years ago

delzac commented 5 years ago

i'm struggling for hours over why an exception is being thrown. If anyone can provide some guidance i'll be really grateful! :)

I'm using the follow grammar in lark-parser to parse alphabets and roman numerals. The grammar is as follows:


DIGIT: "0".."9"
INT: DIGIT+
_L_PAREN: "("
_R_PAREN: ")"
LCASE_LETTER: "a".."z"
ROMAN_NUMERALS: "viii" | "vii" | "iii" | "ii" | "ix" | "vi" | "iv" | "v" | "i" | "x"

?start: qns_num qns_alphabet  qns_part
qns_num: INT?
qns_alphabet: _L_PAREN LCASE_LETTER _R_PAREN | LCASE_LETTER _R_PAREN | LCASE_LETTER?
qns_part: _L_PAREN ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS _R_PAREN | ROMAN_NUMERALS?

When i use this rule and parse the follow text, i get an exception:

# lark.exceptions.UnexpectedCharacters: No terminal defined for 'i' at line 1 col 5
# 10i)i)
#     ^
result = Lark(grammar, parser='lalr').parse("10i)i)")

For the life of me, i can't think of why this throws an exception. But this is fine:

result = Lark(grammar, parser='lalr').parse("10(i)(i)") # no error

MegaIng commented 5 years ago

The problem is that the letter i can either be parsed as LCASE_LETTER and ROMAN_NUMERALS. This means that the first i is actual parsed as a ROMAN_NUMBERALS-Token, making it invalid to parse them the other way around. There are two ways to fix this:

delzac commented 5 years ago

Thanks for the reply @MegaIng!

Can you also explain why is it that "10(i)(i)" doesn't throw an exception? Doesn't the same conflict apply in terms of i can either be parsed as LCASE_LETTER and ROMAN_NUMERALS?

erezsh commented 5 years ago

The reason this happens, is because both rules can be empty, which causes the lexer to always jump over one of them in order to match the terminal with the higher priority.

With one rule empty and the second one matched, the parser expects an EOF, not more input. The introduction of ( forces the rule to not be empty.

So, changing the priority on LCASE_LETTER won't help. But not allowing the first rule to be empty will.

The Earley algorithm will know how to resolve this ambiguity automatically.

delzac commented 5 years ago

Thank you for taking the time to share @erezsh! Really appreciate your input :))