Candidate sets for Antlr grammars

kaby76 commented 4 years ago

Mike,

Thanks for this library.

I am interested in computing the set of possible lookaheads in an Antlr grammar itself, specifically Expr.g4, defined at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/test/Antlr4CodeCompletion.CoreUnitTest/Grammar/Expr.g4.

I've ported your library to use Antlr4.Runtime.Standard (v4.7.2), and updated the code to use the Java Antlr tool generator (https://github.com/kaby76/antlr4-c3). (Note, unless there is a compelling reason why people keep using Harwell's Antlr4cs generator, https://github.com/tunnelvisionlabs/antlr4cs, I prefer to use the standard Antlr Java tool and Antlr4.Runtime.Standard. As far as I know, no other target has a separate Antlr tool generator, and Harwell's tool is several version behind the current tool version 4.7.2.) I then added a test (https://github.com/kaby76/antlr4-c3/blob/master/ports/c%23/XUnitTestProject1/UnitTest1.cs#L121) that has the input of Expr.g4 erased from line 9 to the end of the file:

grammar Expr;

expression: assignment | simpleExpression;

assignment
    : (VAR | LET) ID EQUAL simpleExpression
;

The input has no syntax errors, but it is, of course, incomplete. I want to simulate typing in a new rule after the last semi-colon, e.g., code completion. I'm expecting at least RULE_REF to appear in the candidate.tokens list because after ";", I can start a new rule. When I call the library to compute the candidate set after the last semi-colon "core.CollectCandidates(index, null)", I get instead a token list of only CATCH, FINALLY, and -2. CATCH and FINALLY make sense because the rule for parserRuleSpec is "parserRuleSpec : DOC_COMMENT? ruleModifiers? RULE_REF argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI exceptionGroup ;". FOLLOW(SEMI) should contain FIRST(exceptionGroup), which in this case, contains CATCH, FINALLY. However, exceptionGroup can derive the empty string, so it should also contain FIRST(parserRuleSpec). If I add to the input "simpleExpression" after the last semi-colon, again I would expect RULE_REF at the caret positioned at "simpleExpression", but I only get CATCH, FINALLY, -2.

My question is why isn't RULE_REF in the lookahead? I've only started to debug your code, and it'll take a while for me to fully comprehend.

--Ken

mike-lischke commented 4 years ago

Ken, can you please clarify: are you looking for completion candidates for the Expr grammar or the ANTLR4 grammar? You mention elements from the ANTRL4 grammar, but also mention Expr (which wouldn't play a role at all when you are interested in the ANTLR4 grammar instead).

kaby76 commented 4 years ago

I'm looking for candidates for Antlr grammars. If the caret is placed after a semicolon of a grammar parser rule, I would expect to see RULE_REF in the list because I can certainly type in a new rule after a semicolon which ended the previous rule. Expr is just sample input, but it could be any grammar.

mike-lischke commented 4 years ago

Now I understand, thanks. The RULE_REF candidate is returned for me at this position. Have you perhaps included that in your ignored token list?

Note: including RULE_REF at this point is probably not helpful, because you would start a new rule when typing a name, for which you cannot provide meaningful completion items (which only can show existing elements).

kaby76 commented 4 years ago

I'm not filtering; the test is all in my forked repo at the link I gave. I'll check it against Harwell's version now.

kaby76 commented 4 years ago

Harwell's tool and runtime (all C#) produces the same result. It looks like the port of C3 to C# may be wrong. Checking this next.

kaby76 commented 4 years ago

Typescript version (generated 4.7.3 version from non-standard build), Harwell's tool 4.6.6 , and Antlr4 4.7.2 standard C# all produce Tokens={26,27,-2} = {CATCH, FINALLY, -2} at token index 34, which is the space just after the very last ";" of input "grammar Expr; expression: assignment | simpleExpression; assignment : (VAR | LET) ID EQUAL simpleExpression ; ". I don't think C3 is computing the token sets correctly. TOKEN_REF should be there. This seems to be similar to issue #23 -- FOLLOW(GET) should contain FIRST(fooBarBaz withQux? EOF). But, since fooBarBase derives epsilon, you need to add to the set FIRST(withQux? EOF). I'll now delve into the code.

kaby76 commented 4 years ago

A short note: the Seek() call at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L100 should not be there! It is misleading; the stream is never used in the ProcessRule() ATN walker. And, it does not exist in the corresponding Typescript code https://github.com/mike-lischke/antlr4-c3/blob/master/src/CodeCompletionCore.ts. Continuing my code reading.

mike-lischke commented 4 years ago

If that call is not in the typescript code then it shouldn't be in any of the ports, you are right.

kaby76 commented 4 years ago

After a bit of reading and debugging of your code, the Antlr ATN parser, and one of Parr's paper on LL(*) over the last week, I believe the sets are not computed correctly in the case where the ATN recognizing the last token of the input contains a stop state of an ATN. ProcessRule() correctly parses the input to the caret, for a particular ATN corresponding to a rule in the grammar. The main loop for the ATN automaton interpreter is at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L468. It performs the parse from a state within that ATN for the input from the pipeline queue, up until it gets to the stop state of the ATN or when a RULE transition occurs and ProcessRule() is then called recursively. The parser "accepts" the input when we arrive at a state via a non-epsilon transition on the last token in the input. You now place that state on the pipeline queue https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L584 and continue to compute the lookahead sets after the caret. At this point, the lookahead set should contain the non-epsilon symbols from the e-closure from the "accept" state, which you obtain via DetermineFollowSets(). But, more importantly, if we are in the stop state for the rule's ATN, the lookahead must also be obtained from the next enclosing state from the call stack of ATNs. It looks like you accumulate the sets in "this.Tokens[]", but ProcessRule() could contain multiple uses of a particular rule in the path of the parse tree down to the symbol at the caret. The final set of lookaheads reported back to the user should not include epsilon (= -2), but your code does anyways. What is interesting is that Parr's Antlr ATN parser runtime also does not correctly predict the lookahead sets either for reporting (https://github.com/antlr/antlr4/blob/master/runtime/CSharp/runtime/CSharp/Antlr4.Runtime/DefaultErrorStrategy.cs#L410). (Take an Antlr grammar like Expr.g4, and place a "." in the input following a ";" at the end of a rule, create a parser for Antlr grammars itself and call this parser on this input. Antlr's default reporter says it expects EOF/mode because these can occur after the semi-colon. But, it does not say that TOKEN_REF or RULE_REF is expected, which clearly it should, since a new rule--whether parser or lexer grammar--can occur at the point of the ".".) What I plan to do is to write my own implementation. A parse is already done via Antlr, but unfortunately, the Antlr runtime only records the start state of a rule (Antlr4.Runtime.RuleContext.invokingState), and not for each state transition from the start state to accept state in the parse tree. But, I'm not sure if modifying the parser would help because I think it just stops the search after obtaining the first valid parse; your code continues, which is correct. --Ken

mike-lischke commented 4 years ago

In addition to your observations there's the fact that the C3 engine has to track back and follow other possible paths through the ATN, which the normal ANTLR4 parser does not do (and which is one of the main reasons why I don't use it for code completion).

kaby76 commented 4 years ago

Yes, I think I read that somewhere in your code comments. I agree, it's important to do a full search of the parse space. -K

kaby76 commented 3 years ago

I decided to revisit this problem after noticing a blog article referencing this library. To put it more clearly, Antlr4-c3 has a bug--it does not compute the correct completion set in some cases where some rules derive the empty string.

Here are two examples. The first one is for the ANTLRv4 grammar which I mentioned above. I've updated a forked copy of the C# port of the code here to contain a unit test for the ANTLRv4 grammar example I gave above. I also observed the same problem with the vscode-antlr4 extension--see the screenshot of VSCode where I positioned the cursor after the semicolon for "expression" in Expr.g4 and ask for the completion set. It shows only "catch" and "finally", not a RULE_REF or TOKEN_REF. ● Expr g4 - Visual Studio Code 9_19_2020 6_41_23 PM

A second example is with the input "A B D" with this split grammar:

parser grammar XParser;
options { tokenVocab = XLexer; }
a : A b c d ;
b : B? X? ;
c : C? Y? ;
d : D ;

lexer grammar XLexer;
channels { HIDE_CHANNEL }
A : 'A';
B : 'B';
C : 'C';
D : 'D';
X : 'X';
Y : 'Y';
WS : [ ]+ -> channel(HIDE_CHANNEL);

The completion set computed for input A B D with token index 3 (which corresponds to the space between 'B' and 'D') is the set 'X', 'Y', 'C'. It should contain 'D' but it doesn't. However, A B D parses without error, so clearly 'D' can follow A B. Note, the Antlr runtime error reporting doesn't do any better, saying that only 'D' is expected for the partial input A B. But, clearly, A B C D also parses, so 'C' should be mentioned (as well as 'X' and 'Y').

The author of the blog article notices that Antlr4-c3 returns an unexpected result for positioning the cursor before the closing parentheses of a print() statement. I haven't looked at his grammar, but it's possible that he's observing the same problem.

alessiostalla commented 3 years ago

I'm the author of the article. The grammar is a slightly modified version of the Kotlin grammar by A. Shadrina.

I'm far from being as knowledgeable as you on antlr4-c3's inner workings. I've just observed that it sometimes behaves in ways that to me are quirky and I've developed a few tricks/heuristics to mitigate that.

kaby76 commented 3 years ago

I've developed a few tricks/heuristics to mitigate that.

The above "X" grammar doesn't have "skip" tokens, so that trick wouldn't apply. But, perhaps the other trick you mention--inserting symbols that derive the empty string--might apply to X. I'll look at this suggestion when I have time.

Last year I wrote a straightforward nondeterministic recognizer as an alternative for C3, but the performance is atrocious, especially when the input contains multiple errors and one doesn't use bail-mode. One problem is that it starts from the beginning of the entire input and re-does the parse, like C3. An alternative that I haven't explored is to compute the start state of the NFA from the state of the parse in ANTLR when SyntaxError() is called, and so avoiding a re-parse of the entire input. Also, I read on another issue thread of a modification to C3 with a large performance improvement, but I haven't looked at it. All interesting things to explore. Unfortunately, I haven't time--I'm writing a Bash-like shell and toolset for ANTLR parsers and trees.

alessiostalla commented 3 years ago

I too have the sensation that reusing the state of the parser makes a lot of sense as most often you'll have to parse anyway (to build a symbol table to suggest identifiers, at least). Since we cannot count on a syntax error to be present when asking for completion, I'm wondering if a sensible strategy could be to decorate the input stream and lexer so as to emit a syntetic "caret" token when encountering the caret. That would reliably cause a syntax error that we could interercept. However, I don't know how to handle when the caret is inside a token (e.g. pr|int, I want printAndMakeMeCoffee instead and decided to put the caret there because I'm a bad person and hey I don't pay all that CPU to be lazy).

kaby76 commented 2 years ago

Has the code been fixed? I just tested the code with the tip of master on this grammar, input, and start index for C3. Am I reading this right? I get as possible valid tokens "B", "X", and -2. Is -2 a valid token type? (See attached .zip file. c3.zip To build, just type "dotnet build". To run, execute "Test.exe", input is a hard-wired string.)

I am expecting not only "B", "X", but also "C", "Y", "D", because non-terminals "b" and "c" are nullable.

Why is "-2" a valid token from C3?

grammar

grammar XXX;

a : A b c d EOF ;
b : B? X? ;
c : C? Y? ;
d : D ;

A : 'A';
B : 'B';
C : 'C';
D : 'D';
X : 'X';
Y : 'Y';
WS : [ ]+ -> channel(HIDDEN);

input

output

Time: 00:00:00.1155655
Parse failed.
Tokens ====
2
 Key = ''B''
5
 Key = ''X''
-2
Rules ====

mike-lischke commented 2 years ago

Oh, sorry, this was closed accidentally. No progress has been made with this issue.

mike-lischke / antlr4-c3