github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.36k stars 4.28k forks source link

Unmatched "end" in grammar spills out of "while" #7015

Closed DanTup closed 2 months ago

DanTup commented 3 months ago

I found a difference in behaviour between VS Code / GitHub syntax highlighting that I originally thought was VS Code's issue, but after some comments on https://github.com/microsoft/vscode/issues/189940 I'm now not so sure.

The issue is when a rule uses begin/end (for example triple-backtick code blocks) that is nested inside a rule using while1 (for example comments). It appears that in VS Code, ifendnever matches, it still never expands outside of thewhile` rule, however on GitHub it does.

Example rules:

        "dartdoc": {
            "patterns": [
                {
                    "contentName": "variable.other.source.dart",
                    "begin": "```",
                    "end": "```"
                }
            ]
        },
        "comments": {
            "patterns": [
                {
                    "name": "comment.block.documentation.dart",
                    "begin": "///",
                    "while": "^\\s*///",
                    "patterns": [
                        {
                            "include": "#dartdoc"
                        }
                    ]
                }
            ]
        },

This colours comments green, and code blocks light blue. When the triple-backticks are closed, it looks like this in VS Code:

image

And if they are unclosed, they close at the end of the comment (using while):

image

However on GitHub, the unclosed triple backticks run to the end of the document (starting on line 10):

image

I previously thought the issue was VS Code's, but the comment at https://github.com/microsoft/vscode/issues/189940#issuecomment-2312138491 suggests that while doesn't allow a begin/end to escape outside of it. On the surface this sounds sensible, so I thought it was worth filing here.

(If this isn't the right repo for issues like this affecting the GitHub UI, I apologise, please point me in the right direction! Thanks!)

lildude commented 3 months ago

I'm pretty sure this is an issue with the grammar attempting to interpret these blocks and not ignoring them as comments. Other grammars don't attempt to intepret content in comments.

Here's Ruby which is using the treesitter grammar directly in the syntax highlighting engine:

# this is comment
# ```
# this is code
# ```
# this is comment
def foo
# this is comment
# ```
# this is code
# 
# this is comment
def foo

And Shell which uses an old Textmate grammar shipped with Linguist:

# this is comment
# ```
# this is code
# ```
# this is comment
def foo
# this is comment
# ```
# this is code
# 
# this is comment
def foo

I believe the Dart grammar should be doing the same but @Alhadis is more knowledgable with dealing with the intricacies of GitHub's old syntax highlighter than me.

DanTup commented 3 months ago

I'm pretty sure this is an issue with the grammar attempting to interpret these blocks and not ignoring them as comments. Other grammars don't attempt to intepret content in comments.

I'm not sure I understand - although I used comments as an example, I don't think that semantics of what a comment means is relevant here. I could just as easily have used some other marker that could be some concept made up by my language, and not a "comment".

I think the question here is just whether begin/end captures should be allowed to break out of while captures. If so, then I think there's a bug in VS Code's implementation, and if they are not, I think there's a bug here. I can't find anything concrete to satisfy me one way or the other, but I'd really like GitHub and VS Code to agree before I try to change the Dart grammar because I might end up going in the wrong direction.

lildude commented 3 months ago

Ah. Sorry, I have no idea. Maybe @Alhadis knows.

For what it's worth, the old syntax highlighter that GitHub uses for Textmate-based grammars (called prettylights - it's not public) is no longer being maintained so if GitHub is wrong, it's been this way for a very very very long time and won't be changing.

DanTup commented 3 months ago

For what it's worth, the old syntax highlighter that GitHub uses for Textmate-based grammars (called prettylights - it's not public) is no longer being maintained so if GitHub is wrong, it's been this way for a very very very long time and won't be changing.

Ah, good to know - I wasn't familiar with exactly what's used where. If it has to remain broken on GH because of a bug in prettylights, so be it - but I don't want to try to fix it here and break VS Code in the process without being sure it's VS Code at fault (and therefore it might be fixed there). Thanks!

RedCMD commented 3 months ago

from what I last heard The TextMate engine has not been archived and is still taking bug reports https://github.com/github-linguist/linguist/issues/6668#issuecomment-1967530241

I'm not sure if this is what GitHub uses https://www.npmjs.com/package/pretty-lights but it seems pretty-lights is not the TextMate engine, merely the CSS styler which would make it unrelated to this bug report

lildude commented 3 months ago

from what I last heard The TextMate engine has not been archived and is still taking bug reports #6668 (comment)

Correct, it hasn't been archived but AFAIK it is not actively being worked on in favour of keeping the lights on whilst switching to tree-sitter grammars.

That said, @patrickt might also be a good one to answer the original question about the correct behaviour.

I'm not sure if this is what GitHub uses https://www.npmjs.com/package/pretty-lights but it seems pretty-lights is not the TextMate engine, merely the CSS styler which would make it unrelated to this bug report

That's not it; the repo is private. We don't use that npm pkg either but if you're curious about where the stylesheets come from, they come from https://github.com/primer/github-syntax-light

patrickt commented 3 months ago

👋 We’re not currently using tree-sitter-dart in our syntax highlighting pipeline—it looks fairly unmaintained, so I’d be against relying on it. However, fixes on the TextMate grammar side should percolate into our highlighting pipeline as a whole. I forget exactly how that works; let me dig in.

Alhadis commented 2 months ago

Ah. Sorry, I have no idea. Maybe @Alhadis knows.

@lildude This matter concerns a rather oblique feature of TextMate grammars (begin/while pairs) that I admit I have little familiarity with. However, PrettyLights's behaviour appears to be correct here: the comment by @RedCMD seems to indicate an implementation difference in VSCode's handling of begin/while pairs, a discrepancy that—given the sketchy and under-documented nature of this particular feature—comes as no surprise.

For this reason, I strongly advise against using while when possible. Instead, I'd use an end pattern with a negative lookahead (end: "^(?!\\s*///)"), and I'd rewrite the embedded rules to short-circuit on the same pattern:

Click to toggle ~~~json "dartdoc": { "patterns": [{ "name": "comment.block.documentation.dart", "begin": "^\\s*(///)\\s*(```)", "end": "^\\s*(///)\\s*(```)|^(?!\\s*///)", "contentName": "variable.other.source.dart", "beginCaptures": { "1": {"name": "punctuation.definition.comment.dart"}, "2": {"name": "punctuation.section.code-fence.begin.dartdoc"} } }] }, "comments": { "patterns": [{ "name": "comment.block.documentation.dart", "begin": "^\\s*///", "end": "^(?!\\s*///)", "patterns": [{"include": "#dartdoc"}] }] } ~~~
DanTup commented 2 months ago

seems to indicate an implementation difference in VSCode's handling of begin/while pairs, a discrepancy that—given the sketchy and under-documented nature of this particular feature—comes as no surprise.

For this reason, I strongly advise against using while when possible

While this may work, I feel like it would be better for everyone if we could try to come to some concensus about what the behaviour should be. There are many grammers and many clients, and if the "spec" is vague, it would be better to at least have some agreement between the most popular implementations than to have to avoid the feature.

If VS Code + GitHub disagree, maybe there are other common implementations we could review to see if there's a most common interpretation that the others would agree to?

Alhadis commented 2 months ago

While this may work, I feel like it would be better for everyone if we could try to come to some concensus about what the behaviour should be. There are many grammers and many clients, and if the "spec" is vague, it would be better to at least have some agreement between the most popular implementations than to have to avoid the feature.

I plan on penning an authoritative spec for the TextMate grammar format, based on past and present implementations of its highlighting system. However, I'm still waiting to hear back from @lildude regarding access to the PrettyLights repository; I'll shoot him an e-mail with a follow-up as he's probably forgotten. 😁

If VS Code + GitHub disagree

I believe it's just VSCode that deviates from other implementations regarding begin/while blocks. I'll need to check, as the truncation of unclosed begin/end blocks within begin/while rules strikes me as unusual.

RedCMD commented 2 months ago

The 1.X TextMate manual does not mention while https://macromates.com/manual/en/ The 2.0 TextMate manual mentions while once. But only states that it is related to begin language-grammars

TextMate's public repo https://github.com/textmate/textmate TextMate.md states

\G matches end of last stacked while

Which VSCode does do, and I assume PrettyLights does as well VSCode does however clear the \\G anchor upon the next rule match. idk what PrettyLights and TextMate does

Looking at TextMates example test grammar It seems while also places a \\G anchor at the beginning of the next line. Which VSCode definitely does not currently do, and I don't think PrettyLights does either. The example also seems to describe that the begin/end rule places its scopes overtop the while capture. Which PrettyLights does do, but VSCode does not.

private.h header file grammar.cc loads grammar file etc parse.cc parses document

https://github.com/textmate/textmate/blob/master/Frameworks/parse/src/parse.cc#L408-L411 seems to indicate that a failed while rule will return the stack back to its own parent, terminating any inside begin/end rules. tho I'm not sure

https://github.com/arquivolta/desktop/blob/main/lib/platform/win32/arch_to_rootfs.dart#L23-L157 https://github.com/dart-lang/markdown/blob/cc57a08156356d7fc8c06c73de972c58ddb39a87/lib/src/block_parser.dart#L422-L1106

DanTup commented 2 months ago

@Alhadis

I plan on penning an authoritative spec for the TextMate grammar format, based on past and present implementations of its highlighting system.

This sounds like a great idea - having something to use as a reference across implementations may avoid these kinds of issues.

I believe it's just VSCode that deviates from other implementations regarding begin/while blocks. I'll need to check, as the truncation of unclosed begin/end blocks within begin/while rules strikes me as unusual.

If this is the case (and you know of other implementations behaving like PrettyLights), it would be worth posting in https://github.com/microsoft/vscode/issues/189940 (I feel you're more qualified than I) as their response was that this is not a VS Code issue, and I think it's better for all if all implementers can agree on something.

That said, I don't know if @RedCMD's post changes anything here - I don't understand all the details, but on the surface it seems to make some good points (thank you @RedCMD for the input!).

lildude commented 2 months ago

I see this has been confirmed as a bug in VSCode as PrettyLights is matching TextMate 2's behaviour.

Closing as there's nothing more to do here.

DanTup commented 1 month ago

@Alhadis I tried implementing your example from above (https://github.com/github-linguist/linguist/issues/7015#issuecomment-2321028904) but I see a problem and I'm not sure whether it's because the parser I'm using is bad (it's a custom parser to run some automated tests), or whether the sample was just incomplete.

Given code like:

/// ```
/// foo
/// ```

The variable.other.source.dart scope begins with /// ``` and ends with /// ``` which means all of the line prefixes /// inside the code block are marked as variables variable.other.source.dart so the comment markers will be highlighted incorrectly.

I think I could add a nested pattern to set them back as comments, but then the scope will be something like "comment variable comment". This feels clunky so I'm wondering if it's really the best way or if I've overlooked something simpler?

Thanks!

DanTup commented 1 month ago

nm, I think I found a way, but removing contentName from that match and using a nested pattern to only apply it to the code after the comment markers :)

{
    "begin": "^\\s*(///)\\s*(```)",
    "end": "^\\s*(///)\\s*(```)|^(?!\\s*///)",
    "patterns": [
        {
            "include": "#dartdoc-codeblock"
        }
    ]
},

// ...
"dartdoc-codeblock": {
    "begin": "^\\s*(///)\\s*",
    "end": "\n",
    "contentName": "variable.other.source.dart"
},