lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

"Unused terminals" false positive in recursive terminal #1246

Open BenWiederhake opened 1 year ago

BenWiederhake commented 1 year ago

Describe the bug In a grammar where one terminal consists of several other concatenated terminals, this is somehow not counted as a "use" of those recursive terminals. This leads to spurious Unused terminals: warnings

To Reproduce Install Lark and run the following code:

import lark
import logging
my_grammar = r"""
    value: IDENTIFIER
    _IDENT_LETTER: "A".."Z"
    DECIMAL_DIGIT: "0".."9"
    IDENTIFIER: _IDENT_LETTER (_IDENT_LETTER | DECIMAL_DIGIT)+
    """
lark.logger.setLevel(logging.DEBUG)
my_parser = lark.Lark(my_grammar, start="value", parser="lalr", debug=True)
tree = my_parser.parse("E2BIG")
print(f"{tree=} -> pretty:\n{tree.pretty()}")
tree = my_parser.parse("ANSWER42")
print(f"{tree=} -> pretty:\n{tree.pretty()}")

Expected behavior It correctly parses the identifiers and prints it to the console:

tree=Tree(Token('RULE', 'value'), [Token('IDENTIFIER', 'E2BIG')]) -> pretty:
value   E2BIG

tree=Tree(Token('RULE', 'value'), [Token('IDENTIFIER', 'ANSWER42')]) -> pretty:
value   ANSWER42

Actual behavior It correctly parses the identifiers and prints it to the console AND complains about the terminals being unused:

Unused terminals: ['_IDENT_LETTER', 'DECIMAL_DIGIT']
tree=Tree(Token('RULE', 'value'), [Token('IDENTIFIER', 'E2BIG')]) -> pretty:
value   E2BIG

tree=Tree(Token('RULE', 'value'), [Token('IDENTIFIER', 'ANSWER42')]) -> pretty:
value   ANSWER42

Additional notes It does not seem to matter whether IDENT_LETTER or DECIMAL_DIGIT begin with an underscore or not. This may or may not contradict what https://raw.githubusercontent.com/lark-parser/lark/master/docs/_static/lark_cheatsheet.pdf says about terminals being "filtered out".