ANTLR Generates invalid Rust for Java 9 grammar

ColinTimBarndt commented 3 years ago

As stated in the title, I tried to generate a parser for the official Java9 grammar in the antlr/grammars-v5 repository and it generated the following code that causes syntax and import errors:

    impl<'input, Input:CharStream<From<'input> >> Java9Lexer<'input,Input>{
        fn JavaLetter_sempred(_localctx: Option<&LexerContext<'input>>, pred_index:isize,
                            recog:&mut <Self as Deref>::Target
            ) -> bool {
            match pred_index {
                    0=>{
                        Character.isJavaIdentifierStart(_input.LA(-1))
                    }
                    1=>{
                        Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))
                    }
                _ => true
            }
        }
        fn JavaLetterOrDigit_sempred(_localctx: Option<&LexerContext<'input>>, pred_index:isize,
                            recog:&mut <Self as Deref>::Target
            ) -> bool {
            match pred_index {
                    2=>{
                        Character.isJavaIdentifierPart(_input.LA(-1))
                    }
                    3=>{
                        Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))
                    }
                _ => true
            }
        }

}

I did not change any indentation. Apart from that, it requires some Character structure which is not imported and I do not know where it is defined. I looked through the generated files and it is not defined there. (char)_input.LA(-2) seems to be a Java leftover. I think that this code originally casted a Java int to a char, which does not exist in Rust.

The raw syntax errors:

cannot find value `Character` in this scope
not found in this scope rustc(E0425)

cannot find value `_input` in this scope
not found in this scope rustc(E0425)

missing `,` rustc
java9lexer.rs (340, 67): original diagnostic
Syntax Error: expected COMMA rust-analyzer
Syntax Error: expected COMMA rust-analyzer
Syntax Error: expected SEMICOLON rust-analyzer
expected one of `)`, `,`, `.`, `?`, or an operator, found `_input`
expected one of `)`, `,`, `.`, `?`, or an operator rustc
java9lexer.rs (340, 67): missing `,`

rrevenantt commented 3 years ago

Embedded parser/lexer actions are written in target language, you need to translate those to Rust manually before generating grammar.

ColinTimBarndt commented 3 years ago

Okay, I was not aware of this ANTLR feature. What is the equivalent of the _input variable in the Rust version?

rrevenantt commented 3 years ago

recog.input.la(-1)

ColinTimBarndt commented 3 years ago

recog.input.la(-1) does not work beacuse recog.input is an Option<Input>. Can I expect that the option is Some?

Input::la returns an isize, but the documentation states that it returns the value of the current symbol in the stream. The byte size of an isize varies depending on the target system and might not be able to fit a whole character depending on the target. Because of this, I can't cast the returned value to a char in Rust.

rrevenantt commented 3 years ago

recog.input.la(-1) does not work beacuse recog.input is an Option<Input>. Can I expect that the option is Some?

Yes, but my initial advice was not perfect, better use recog.input().la(-1) which handles it.

might not be able to fit a whole character depending on the target

True, but do you really want to run it on 16bit targets and support full unicode codespace?

Because of this, I can't cast the returned value to a char in Rust.

If your only problem is to cast back to char, you will have to do conversion with char::try_from().unwrap() regardless of the type I can choose to hold current symbol, because there is dedicated EOF symbol I have to support.

Also java parser code that you linked assumes parsing over UTF-16 code units(not really sure why). Technically, you can port it to Rust exactly like this with some manual UTF-16 transformations. But I would really recommend you to parse over full Unicode code points. That will let you parse over Unicode str directly, and lexer will have only two | ~[\u0000-\u007F] { <check for java unicode identifier> }? parts.

ColinTimBarndt commented 3 years ago

Thank you very much for your help, I finally got ANTLR working with your repository. I ran a first test with the following Java code and it completed parsing successfully with a parsing tree. The modified grammar file might be useful for some people, but I am unsure where to make it available.

abstract class TestClass extends Other {
    protected final int alpha;
    public TestClass(int a) {
        super();
        this.alpha = a;
    }
}

rrevenantt / antlr4rust

ANTLR Generates invalid Rust for Java 9 grammar #17