Antlr4ng fails to parse codeql/examples/alias.qll

kaby76 commented 8 months ago

Testing Antlr4ng with all of grammars-v4 (I am adding the target to testing, started with https://github.com/antlr/grammars-v4/pull/4010).

Antlr4ng fails on grammar https://github.com/antlr/grammars-v4/tree/master/codeql with alias.qll. Output:

$ bash run.sh ../examples/alias.qll -trace > trace.out
line 1:0 mismatched input 'module' expecting {'predicate', 'abstract', 'cached', 'external', 'extensible', 'final', 'transient', 'library', 'private', 'deprecated', 'override', 'additional', 'query', 'pragma', 'language', 'bindingset', QL_DOC}
TypeScript 0 ../examples/alias.qll fail 0.131
Total Time: 0.134
03/15-06:55:25 ~/temp/g4-antlr4ng/codeql/Generated-Antlr4ng

All other targets parse this input file without error.
Parser.setTrace(true) works, but I really need the trace_atn_sim?: boolean; flag and associated output statements in order to compare ATN sets. This output has been extremely valuable in fixing bugs in several targets.

mike-lischke commented 8 months ago

OK, I'll add it back. Somehow I assumed that would be something internal for the original Java runtime development.

While you are at that, it would be great if you could also look after the 4 tests that are disabled for the TS (and other) runtimes, which are:

LexerExec/AtnStatesSizeMoreThan65535
ParserErrors/ExtraneousInput
Performance/DropLoopEntryBranchInLRRule_4
SemPredEvalParser/PredFromAltTestedInLoopBack_1

All four are generated along with all the other runtime tests when you execute npm run generate-runtime-tests in this repo, but are disabled in the Jest tests by using the xit() function (instead of it() like the others).

mike-lischke commented 8 months ago

Check commit 8d0acef.

kaby76 commented 8 months ago

Making slow progress, but it looks like it's failing in DefaultErrorStrategy.Sync(), specifically with the call to "nextTokens()", where it's not returning the correct set of tokens to expect ({35, 53..65, 72, 74, 107}). (Should be {{8, 29, 35, 53..65, 72, 74, 107}}, which has 'class' and 'module', but not in the Antlr4ng set.) So, that's why the parse error is a mismatch. Unfortunately, this part of the code gets difficult to debug because the LL1Analyzer.Look() method is recursive, and we're in a ATN state (786) that has lots of eps-transitions that go to more eps-transitions. Will pick it up in the morning.

graphviz (13)

mike-lischke commented 8 months ago

I found the error. It's an invalid simplification in LL1Analyzer. There's even a comment from me pointing out a risk by using this approach (using a hash code for lookup instead of the entire ATNConfig). By changing this back to what the antlr4 runtime does, things work fine now.

Unfortunately, this makes the runtime a bit slower 😒. I need to find a better way to do these lookups without creating intermediate class instances.

kaby76 commented 8 months ago

Right. It's this code here with "lookBusy". Starts out at state 786 in the "alias" ATN, and goes through the three eps transitions. It works fine for the first, but when it gets to the second or third, it wants to process a rule-transition for "annotations" in a single-state context, Look() enters state 72, the start of the ATN for "annotations". As it already processed a rule call from "alias" to "annotations", it was recorded in "lookbusy", and Look() does a return rather than modify "look" in this context.

mike-lischke commented 8 months ago

Seems this case is not covered in the unit tests. Having a rule with alts that use the same left hand side obviously leads to this situation (and that's a pretty common one, given how often people repeat parts in each alt for better readability, instead of factoring the common part out).

Somehow I think parsing all grammars in the grammar repository as part of the unit tests is a very helpful measure. Unfortunately it involves target specific code which the runtime author would have to provide...

kaby76 commented 8 months ago

Well, yes, for grammars in grammars-v4 that need target-specific code. But, there's still several hundred grammars that don't require target-specific code, so grammars-v4 is pretty useful, even though some say "it's just chock-full of crappy grammars." Yes, that's true, but that's also why it's kind of useful. There were only a couple of failed grammars for Antlr4ng.

If you can make a new release, I can try the grammars in grammars-v4 again. Ty.

Also, I'm kind of perplexed why this ATN.nextToken(ATNState) computation isn't memoized because it doesn't change given a state, and interpreting the state transitions through a graph is expensive.

mike-lischke commented 8 months ago

It's more of an optional feature. Jim Idle implemented it for better error recovery, but it's not really necessary. A custom error strategy could just jump over it, although the speed gain is rather negligible.

If you can make a new release, I can try the grammars in grammars-v4 again. Ty.

Done.

mike-lischke / antlr4ng

Antlr4ng fails to parse codeql/examples/alias.qll #47