lezer-parser / lezer

Dev utils and issues for the Lezer core packages
33 stars 1 forks source link

bug causing wrong token to be parsed in areas where they shouldn't be valid. #50

Closed rvion closed 6 months ago

rvion commented 6 months ago

hello @marijnh , thanks for the awesome lib

I'm having an issue with overlapping tokens. I think this is a bug. various @precedence config yield invalid parses, despite the grammar beeing non ambiguous. πŸ”΄ causes the wrong token to be parsed in areas where they shouldn't be valid.

Full executable repro with test case here: https://github.com/rvion/lezer-repro-2024-02-05 (scripts in package.json to run test file and regen grammar)

but everything needed to see if it is a bug should be here πŸ‘‡

/*
input: "123 (foo)1"
what I want:
| File            (123 (foo)1)  (0->10)
|     Word        (123)         (0->3)
|     Group       ((foo)1)      (4->10)
|         Word    (foo)         (5->8)
|         Number  (1)           (9->10)

*/
@top File { expression+ }
expression[@isGroup=Expression] { Group | Word }
@skip { space }
Group { "(" expression+ ")" Number }
@tokens {
    space { @whitespace+ }
    Word { ($[A-Za-z0-9._\\\/\-] )+ }
    Number { "-"? $[0-9.]+ }

    // 🟒 WORKS when no precendence, or when just Word in precedence
    @precedence { Word }

    // πŸ”΄ the group weight is parsed as a identifier despite the grammar beeing
    // explicit / non ambiguous. Word shouldn't have been considered there.
    // |File            (123 (foo)1)  (0->10)
    // |    Word        (123)         (0->3)
    // |    Group       ((foo))       (4->9)
    // |        Word    (foo)         (5->8)
    // |        ⚠       ()            (9->9)
    // |    Word        (1)           (9->10)
    @precedence { Word, Number }

    // πŸ”΄ the top level 123 is parsed as a number,
    // despite the grammar not allowing number at the top level,
    // and despite the whole thing beeing unambiguous
    // |File            (123 (foo)1)  (0->10)
    // |    ⚠           (123)         (0->3)
    // |        Number  (123)         (0->3)
    // |    Group       ((foo)1)      (4->10)
    // |        Word    (foo)         (5->8)
    // |        Number  (1)           (9->10)
    @precedence { Number, Word }
}
marijnh commented 6 months ago

This is how token precedence works. @precedence { Word, Number } means that any token that can be tokenized as both a word and a number should be number, regardless of what the parser accepts at that point.

You might be able to get what you want by defining two different 'word' tokensβ€”one that is only used in situations where numbers aren't valid, and once, with lower precedence than number tokens, that is used in places where numbers may occur.