lark-parser / Lark.js

Live port of Lark's standalone parser to Javascript
MIT License
71 stars 12 forks source link

Various errors when using `|` inside of terminals #31

Open swwu opened 2 years ago

swwu commented 2 years ago

I've noticed some errors when using a terminal "production" rule of the form

T0: T1 | T2 | T3

where all of the given expressions are terminals. These errors only occur in the standalone parser generated by Lark.js; the same grammar will correctly parse an identical string in the python version of lark. I've isolated two hopefully-minimal-enough example cases below.

This seems to be similar to #21 in that it's related to some Javascript-specific regex foible that gets encountered when agglomerating terminals together via |, but as I'm not super-familiar with the internals of the library I can't be sure. As in #21, replacing VALUE with value everywhere (i.e. replacing the terminal rule with a non-terminal one) causes both of the following examples to parse correctly.

Example 1

This grammar:

?start: thing
thing: thing W thing
    | expr
expr: label W? VALUE
    | VALUE
label: BARE_WORD W? ":"
W: /[ \t\n\v\f]/+
VALUE: NUMBER | BARE_WORD | STRING
BARE_WORD: /[^\s:\(\)]/+
STRING: "\"" /((?:\\"|[^\r\n"]))/* "\""
NUMBER: /[0-9]+/

fails with UnexpectedToken when attempting to parse the string "a:b", although running it in the Python version of Lark results in a correct parse.

Example 2

This grammar:

?start: thing
thing: label VALUE | VALUE
label: BARE_WORD W? ":"
W: /[ \t\n\v\f]/+
VALUE: NUMBER | BARE_WORD | STRING
BARE_WORD: /[^\s:\(\)]/+
STRING: "\"" /((?:\\"|[^\r\n"]))/* "\""
NUMBER: /[0-9]+/

fails with SyntaxError: Invalid flags supplied to RegExp constructor 'nully' during lexing of the same string "a:b"; the Python version also correctly parses it.