Fix C++ port behavior divergence with reference implementation

vityaman commented 1 month ago

Description

During rewriting test scenarios from TypeScript to C++ there was explored a behavior divergence in C++ port.

Also after adding GitHub Workflow for Java also failing test was detected.

Examples of that

[Java, C++], candidates.tokens[ExprLexer::VAR] returns [] instead of [ExprLexer::ID, ExprLexer::EQUAL] : https://github.com/mike-lischke/antlr4-c3/commit/83d73ad8f78ddf7a6526ef2ad06b09a0c840b16f
[C++], underfilled idexpressionStack on TEST(CPP14Parser, RealCppFile)

Related issues and MRs

vityaman commented 1 month ago

I found a bug.

When running tests using ctest (cd test && ctest) they passed, but running tests from a binary file ./expr/antlr4-c3-test-expr they fail. Then I explored that CTest runs all tests by sequentially running a binary file with a filter for an only test case, while running tests using a binary file leads to all test scenarios execution.

So, I decided that the problem was in some implicit dependency of test scenarios. I switched Expr MostSimpleSetup and TypicalSetup test cases and got test failure.

This is because of caching setsPerState[startState->stateNumber]. Ignoring the cache helped and made tests green.

mike-lischke commented 1 month ago

But disabling the cache will certainly make everything slower. The must be a difference in behavior of the cache in both TS and C++.

vityaman commented 1 month ago

I did some more research. What I found.

The problem is that cache setsPerState is not idempotent. FollowSetsHolder contains the same FollowSetWithPath::path, FollowSetWithPath::intervals and combined, but different FollowSetWithPath::following.

How is that possible while procedures determineFollowSets -> collectFollowSets -> getFollowingTokens look quite isolated? Quite, but not entirely, because getFollowingTokens uses ignoredTokens in a such way that it will exclude an ignored token from following sequence.

In ExprTest::MostSimpleSetup and ExprTest::TypicalSetup test cases, we have different sets of CodeCompletionCore::ignoredTokens. As {ExprLexer::ID, ExprLexer::EQUAL} are not ignored at the first scenario, they are inside the following sequence. But in the second scenario we ignore them, and therefore we exclude them.

Based on this fact, I can say that current C++ port behavior is correct relatively to getFollowingTokens correctness, as it returns an empty following set, while TypeScript version returns non-empty only because of that setsPerState cache.

@mike-lischke, I propose to drop cache at CodeCompletionCore configuration update. It can be done by adding code to property setter or checking that configuration was not changed from the last collectCandidates (check hashCode and then equals).

If I could design a library differently, I would make everything as immutable as possible to avoid such bugs with caches and also improve performance by building data structures with higher level of specialization for a task. I don't see the advantage of having mutable ignoredTokens for example, as in the typical usage this field is not changed. For example, I prefer to instantiate this class as well as lexer, parser and other things once, and on a new input do reset on everything and rerun collectCandidates with different caretTokenIndex and context. But maybe I should grep over antlr-c3 usages over the GitHub and check how it is used by other people.

Also, I would like to say that this behavior of getFollowingTokensis strange. Because of ignoredTokens we can produce invalid following sequence, for example, {SET: [ID, NUMBER] } instead of {SET: [ID, ASSIGN, NUMBER]}. In suggestions system it is quite unexpected output. I think, it is more logical to return tokens until the first ignored one, for example {SET: [ID]}, or don't check for token ignorance at getFollowingTokens at all. Because I thought that ignoredTokens exists to exclude a token from that tokens: Map<Token, Following> keys.

mike-lischke commented 1 month ago

That will take some time for me to analyze and understand. Currently I have no bandwidth to dive deep into the c3 algorithm again.

mike-lischke / antlr4-c3

Fix C++ port behavior divergence with reference implementation #136

Description

Related issues and MRs