jamesrhester / Lerche.jl

A Julia port of the Lark parser
MIT License
45 stars 3 forks source link

Possible bug defining terminals #26

Closed Amval closed 2 years ago

Amval commented 2 years ago

Hello,

I need to parse a relatively complex grammar and I was exploring this library. However, I don't seem to be able to get started. It could very well be I am not understanding how to use it. I constructed a simple example:

struct TreeToOWL <: Transformer end

test_data = """
Prefix(12345)
"""

example_grammar = raw"""
    INTEGER : /\d+/
    WORD : /\w+/

    ?start: prefix
    prefix: "Prefix(" INTEGER  ")"

    %import common.WS
    %ignore WS
"""

example_parser = Lark(example_grammar, parser="lalr", lexer="standard", transformer=TreeToOWL())

parsed = Lerche.parse(example_parser, test_data) # => Tree(prefix, Any[Token(INTEGER, 12345)])

If I change the prefix rule from: prefix: "Prefix(" INTEGER ")" to prefix: "Prefix(" WORD ")" I get the following:

ERROR: ERROR: KeyError: key "WORD" not found

I can not make sense out of this. I tried two more variants:

WORD: /[a-z]+/   # => works
WORD: /[a-zA-Z]+/   # => KeyError

So I don't seem to be able to write more complex regexp either. I also took a look to your dREL grammar and found this: ID : /[A-Za-z_][A-Za-z0-9_$]*/

And I don't seem to be able to use it either. Am I doing something wrong? Thanks in advance!

I replicated this issue in both Julia 1.6.1 and Julia 1.7.0-rc2.

jamesrhester commented 2 years ago

The issue here is that the character sequence Prefix can also match the WORD terminal. When the WORD terminal is not used in any productions, it is dropped from the possible matches, which is why the first example works. When it is included in the grammar, the lexer will match both it (value Prefix) and the literal Prefix(. Which is chosen ends up being which one appears first. [a-z]+ works because the capital P means that WORD does not match Prefix.

You should see the message:

Unexpected input: UnexpectedToken(Token(WORD, Prefix), String[], 1, 1, nothing, 4, 1, nothing, false)
ERROR: Unexpected token Prefix at line 1, column 1.

from which you can deduce that token WORD has been returned by the lexer, with value Prefix. That message could be clearer, I see.

You can fix this by either using a contextual lexer lexer="contextual" (which restricts the terminals that can match to those that are possible in the grammar) or setting priorities.

Amval commented 2 years ago

Ok, now that makes complete sense. I was a bit perplex. Changing lexer="contextual" allows me to do what I intented, Thank you very much for the explanation and the quick answer!