jeff-hykin / better-cpp-syntax

💾 The source of VS Code's C++ syntax highlighting
GNU General Public License v3.0
154 stars 29 forks source link

#endif incorrectly closes comments. #359

Open matter123 opened 5 years ago

matter123 commented 5 years ago

Comments are replaced with whitespace in phase 3 of translation. The preprocessor is executed in phase 4 of translation. Therefore with regards to preprocessor conditionals, comments that contain preprocessor instructions should be ignored.

This likely affects raw-string literals as well, but I am going to file a separate issue with that after investigating the phase 1 & 2 rewindings in raw string literals.

Example

#if 1
// the below comment completely consumes the #if 0/#endif
/* a comment
#if 0
this is still a comment
#endif
still a comment
*/ // should not be marked as invalid
#endif

Example Image:

Screen Shot 2019-08-19 at 09 10 23

From: https://github.com/microsoft/vscode-cpptools/issues/4104#issuecomment-522416336

matter123 commented 5 years ago

Only #endif causes an error with raw string literals, So I'll just piggyback on this issue.

#if 1
R"(
/* not a comment
#endif
)"
*/ // this is an illegal terminator
Screen Shot 2019-08-19 at 09 26 34
matter123 commented 5 years ago

This will be hard to fix without code duplication due to no way of allowing comments and raw string literals to overshoot the while pattern.

One potential solution is that the while pattern ends when a comment/raw string starts, and when that ends it picks back up, that requires some chaining of begin/end patterns, however.

Something like (hope this is clear)

conditional_context:
 start_pattern: # the ifdef pattern
 end_pattern: lookAheadFor(/./) # end this range the moment sub patterns do not consume a character
 includes:
  initial:
   start_pattern /\G/,
   while_pattern: # endif or lookAhead For (raw string or block comment)
  comment: #normal comment
  raw_string: # normal raw string
  continuation:
   start_pattern: LookBehindFor(comment end or raw string end)
   while_pattern: # endif or lookAhead For (raw string or block comment)
  white_space: # consumes whitespace to prevent premature closure of outer pattern
   match: /\s++/

Basically, initial, continuation, comment and raw_string, trade-off on marking up portions of the document covered by a conditional context. conditional_context's outer pattern keeps the entire thing bundled up.

Actually, this won't work if the comment or raw string is not the first non-whitespace character on the line. Another pattern could be added to consume the non-whitespace and just deal with the fact that multi-line patterns break with comments.

Edit: this has a much more serious issue, scopes are completely lost whenever a block comment appears. (the last character of initial, has a different rule stack than the first character to continuation)

matter123 commented 5 years ago

I'm fairly sure that the only true solution to this and #350 is replacing the begin/while rules with begin/end rules. I'm going to do a check to see if Textmate behaves the same with nested whiles with regard to #350.

But I don't see a fix without adding another ~320k of grammar to end patterns when #endif is found, as unless some other pattern can consume the #endif between the while checks its a moot point of which order while checks are performed.

Edit: Textmate closes while patterns in a top-down fashion, see https://github.com/microsoft/vscode-textmate/issues/110 for more information and an image of nested if/endif working correctly in Textmate

msftrncs commented 5 years ago

@matter123, I am just thinking out loud here, so to speak.

I haven't had much experience in C/C++ recently, from what you said above, are preprocessor directives allowed at all in block comments? If not, then WHILE rules will always be susceptible catching a commented out #endif.

From a concept point of view, is the TextMate CPP grammar file scoping the C++ code or the preprocessor directives? They are distinctly different concepts, and the TextMate grammar file can only scope one concept at a time.

Thinking about what is really happening.

  1. When a conditional preprocessor directive such as #ifdef is found, the grammar scoping engine needs to make a clone of the current scope stack. (not currently possibly in TextMate) Its completely possible that the scope at the end of each alternate could be different than at the start.
  2. As each alternative code path is found, the grammar scoping engine technically needs to resume from the clone of the scope stack and complete the scoping for the alternate path.
  3. Preferably when the final alternate is done, all alternates ended with the same scope stack, so the last clone could just be continued. Preferably, it would be possibly to detect the failure of each alternate to end with the same scope stack.

The current TextMate grammar engine has no way it can ever correctly handle 100% of the possible scenarios that conditional preprocessor directives could create. These are things that need to be considered in any future grammar engine.

matter123 commented 5 years ago

@msftrncs Yeah, the standard specifies that, preprocessor directives are handled after comments are removed, so anything resembling a preprocessor directives inside the block comment is not actually.

I think there is a solution to that, that involves splitting the preprocessor conditional block into code sections and block comment sections. https://github.com/jeff-hykin/cpp-textmate-grammar/issues/359#issuecomment-523076458 contains a simplified grammar example of that.

As far as dealing with alternate code paths, you would have to ask @jeff-hykin, I think it involves not closing scopes on #endif (the opposite of the current behavior), but I am not sure how it would work.

matter123 commented 5 years ago

I think if the begin/end for #if only closed with #else and #elseif, and this pattern did not add any scopes, then it should work as expected. The while stack would end up with some rules that will never close, but I am not sure that is an issue.

The biggest downside of this is that meta.preprocessor.conditional.* would have to go to fix/prevent a regression of #350. Investigation would be needed to see if those scopes are used.

msftrncs commented 5 years ago

Actually, this won't work if the comment or raw string is not the first non-whitespace character on the line. Another pattern could be added to consume the non-whitespace and just deal with the fact that multi-line patterns break with comments.

It could work, just a pattern would be needed that would consume all patterns that prevent multiline comments and raw-strings: single line comments, single line strings, single line block comments, etc, including to be sure to catch escape sequences. I have used a similar approach in my fork of PowerShell/EditorSyntax, to scan a line for possible arguments to a command, when the command could be a label, but needing to ignore comments. When no arguments are found, its a label. This has to happen in a single match rule since there is no was to backtrack rules.

matter123 commented 5 years ago

Yes, but the more serious issue with that approach is that comments/raw string would cause the rule stack to lose entries.

Imagine the following c++ snippet

#if foo
enum bar {
ENUM_MEMBER_1,
ENUM_MEMBER_2,
/*
multiline comment
with a preprocessor directive
#endif
*/
ENUM_MEMBER_3,
}
#endif

Currently, the grammar keeps track that it is in an enum to tag identifiers as variable.other.enummember.cpp, If the conditional section is split up into code sections and comment sections, the rule stack at ENUM_MEMBER_3 is no longer in an enum context.

I am not sure that there is a solution that doesn't involve duplicating all the begin/* rules to end when there is a #endif.

matter123 commented 5 years ago

It could work, just a pattern would be needed that would consume all patterns that prevent multiline comments and raw-strings: single line comments, single line strings, single line block comments, etc, including to be sure to catch escape sequences.

What follows is mostly a summary of why various things won't work.

The issue is, to correctly deal with alternate code paths, the conditional rule may need to close all sub scopes, so a while pattern must be used. The only thing that a while rule can be predicated is the lack of a preprocessor directive on this line. This check for no directive ignores any other rules on the stack. Therefore, comments normally do not correctly prevent the discovery of preprocessor directives.

This implies that the conditional rule is also predicated on no comments, that start on this line and continue past this line. This poses another problem. The conditional context and the rules above it are now closed, even if those rules change the context of the grammar. Additionally due to the way while works this affects any character that is on the line that terminates the while rule.

Ignoring preprocessor directives does not work, as conditional preprocessor directives can provide alternate code paths, that if, parsed together would be a syntax error.

Imagine:

#if __has__explicit_bool()
explicit(foo) Bar(Foo) {
#else
explicit Bar(Foo) {
#endif

The only disadvantage to duplicating the begin/* rules is that, the addition is somewhere around 300kb of new patterns, minified.

jeff-hykin commented 5 years ago

I have not caught up on this thread yet, I just wanted to let you know @matter123, I removed the while from the preprocessor conditional and it fixed this issue. This issue might still be relevant though, and we might go back to the while pattern