rrevenantt / antlr4rust

ANTLR4 parser generator runtime for Rust programming laguage
Other
404 stars 70 forks source link

Issue when parsing grammar with case-insensitive keywords (maybe fragment or precedence related) #2

Closed jhorstmann closed 4 years ago

jhorstmann commented 4 years ago

I'm using the following pattern to parse keywords in a case-insensitive way:

SELECT  : S E L E C T;
FROM    : F R O M;
WHERE   : W H E R E;
IDENTIFIER
        : [a-zA-Z_] [a-zA-Z_0-9]*
        ;

fragment A  : ('a'|'A');
fragment B  : ('b'|'B');
fragment C  : ('c'|'C');
fragment D  : ('d'|'D');
fragment E  : ('e'|'E');
fragment F  : ('f'|'F');
...

The full grammar and test case can be found at https://github.com/jhorstmann/rust-antlr-case-insensitive-keywords. The lexer and parser are generated by build.rs and the project can be run with cargo run to output the parsed as of a sample query.

select foo as f, bar as b from table where baz limit 10

I see the following output which shows an error message from the parser:

line 1:37 mismatched input 'where' expecting {<EOF>, WHERE, LIMIT}
[src/main.rs:55] query = Query {
    columns: [
        Column {
            name: "foo",
            alias: Some(
                "f",
            ),
        },
        Column {
            name: "bar",
            alias: Some(
                "b",
            ),
        },
    ],
    from: "table",
    filter: None,
    limit: None,
}

It's interesting that other keywords like AS or FROM seem to be parsed fine. A similar grammar in a java project also handles all keywords like this in a case insensitive way.

My guess would be that it's somehow related to the precedence of the lexer rules between, where the keywords should take precedence because they are listed before the IDENTIFIER rule in the grammar.

rrevenantt commented 4 years ago

Hmm.. without EOF in grammar it works. So the problem is in the EOF handling, not sure yet where it is exactly, though.

jhorstmann commented 4 years ago

I think omitting the EOF just makes the parser stop at the first thing it does not recognize without reporting an error and with None for the where clause.

But I just found another strange thing, if I change the grammar and query to use shorter keywords everything gets parsed correctly:

SELECT  : S E L;
WHERE   : W H;
AS      : A S;
FROM    : F R O M;
LIMIT   : L I M;
jhorstmann commented 4 years ago

In fact I can simplify the example to just

query   : SELECT EOF;

SELECT  : S E L E C T;

SPACES  : [ \t\r\n] -> skip ;

and when parsing "select" it prints:

line 1:0 token recognition error at: 'selec'
line 1:5 token recognition error at: 't'
line 1:6 missing SELECT at '<EOF>'
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/main.rs:20:25

I pushed the simplified example to my repo.

rrevenantt commented 4 years ago

Thanks, that indeed helped. Forgot to calculate initial hash in one place. I will fix it in 0.1.1 soon. I have published antlr-rust on cargo recently, so you can start using it as normal dependency, if you want.

jhorstmann commented 4 years ago

Sorry for the late reply, version 0.1.1 fixed the issue and works very nicely. Thanks a lot!