github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.12k stars 4.2k forks source link

Outstanding Grammar Issues #3924

Open vmg opened 6 years ago

vmg commented 6 years ago

The following is a detailed list of all the outstanding issues in the grammars that GitHub.com uses for syntax highlighting the code in our website.

These issues are detected by our grammars compiler (https://github.com/github/linguist/pull/3915) and are probably causing minor rendering bugs in the website.

Help is very much welcome! If you're seeing bugs or rendering issues in your source code in GitHub, please start by taking a look at this list to make sure we're not detecting any issues in your language's grammar.

Feel free to ask any questions about any given issue and what would be the appropriate way to fix it. I'll keep the issue up-to-date as I work through grammar fixes myself.

cc @github/linguist @pchaigno @Alhadis


Last updated: 2 Sep 2024


Other

lildude commented 6 years ago

Thanks for the useful and actionable list @vmg. One question...

repository vendor/grammars/language-babel (from https://github.com/gandm/language-babel) (14 errors)

Is this the actual upstream grammar as it stands now or the old pinned copy we're shipping with Linguist at the moment?

I'm assuming the latter, but thought I'd check to be sure.

vmg commented 6 years ago

I'm assuming the latter, but thought I'd check to be sure.

Correct! I fucked up my submodules. Sorry about that, I'll update the list with the proper URL.

Alhadis commented 6 years ago

Hey, sorry about the (really) slow response. My MacBook died last week, which means I've been painfully limited in what I'm able to do on GitHub (I'm using my work's computer for the time being, when time permits).

The issues reported by my grammars are easily fixed; but the LilyPond grammar should be fixed with a PR to the upstream repository. Basically, the scope.AtLilyPond should be replaced with just scope.lilypond so it's consistent with other LilyPond grammars (and I can therefore replace the offending rule in text.roff with a single inclusion: {include: "source.lilypond"}.

Alhadis commented 6 years ago

What's the maximum token length permitted by the new compiler? I was about to start fixing the issues with Emacs Lisp's grammar, but realised I don't have the actual limit to go by.

Admittedly, I'm not really fond about the size limit, because the "fix" here is to simply break the pattern down into multiple rules that're bunched together under the same name. It feels terribly hacky, and the fact that the rules in question were compiled from an external source means that updating the list in future might be made more complicated...

vmg commented 6 years ago

@Alhadis: Sorry, you caught me on Holidays. The maximum size is enforced by PCRE, not by our parser, and it's 64kb for a single regexp. I'm aware it's a bummer, but it's the way PCRE was designed.

Alhadis commented 6 years ago

That's understandable. How is whitespace treated inside expressions which use "expanded" notation...?

m/
    abc
    (?:
        xyz
    )
    (?=\w+)
/x;

Because there are two different ways to represent that in CSON. One is with an ordinary quoted-string, which includes embedded newlines as part of the pattern...

pattern: "(?x)
    abc
    (?:
        xyz
    )
    (?=\\w+)
"

... and the other is to use triple-quoted strings ("heredocs"):

pattern: """(?x)
    abc
    (?:
        xyz
    )
    (?=\\w+)
"""

The latter will strip as much indentation as it can, leaving some (but not all) horizontal whitespace after the CSON-to-JSON conversion:

(?x)
    abc
    (?:
            xyz
    )
    (?=\\w+)

Now this won't make any difference to the regex engine, but it will to my subdivision efforts... :grinning:

vmg commented 6 years ago

@Alhadis I'm honestly not sure of how exactly does PCRE implement this -- you should be able to test it out by simply downloading libpcre and trying to compile in the regexps. Our parser has no custom behavior here.

Alhadis commented 6 years ago

Okay, that's the last of my grammars fixed. πŸ˜‰

pchaigno commented 6 years ago

@Alhadis You closed this by mistake, right? :smile_cat:

Alhadis commented 6 years ago

Yeah, sorry. I didn't even notice I'd pressed the wrong button to comment. My mistake. πŸ˜“

vmg commented 6 years ago

@Alhadis :bow::bow::bow:

lildude commented 6 years ago

I've updated the sublime-mask output in the OP as the latest grammar compiler now prefer the "compiled" .tmLanguage file over the YAML file and sublime-mask has only updated the YAML file. I have pinged the author in https://github.com/tenbits/sublime-mask/pull/1 asking them to update the mask.tmLanguage file too.

pchaigno commented 6 years ago

repository vendor/grammars/oz-tmbundle (from https://github.com/eregon/oz-tmbundle) (1 errors) Grammar conversion failed. File Originals/Oz.tmLanguage failed to parse: XML syntax error on line 23: expected element name after <

@vmg Do you have more information on that error? I don't see anything weird in the XML format...

Alhadis commented 6 years ago

I don't see anything weird in the XML format...

It's mixing tabs and spaces for indentation, but other than that, I can't see anything wrong either... :confused:

vmg commented 6 years ago

The issue is that it's loading Syntaxes/Oz.tmLanguage and Originals/Oz.tmLanguage. The grammar in Originals is not even valid XML, but has the same extension, so we attempt to load both.

This is fixable in the compiler (most likely fix: ensure that the .tmLanguage file is in a Syntaxes folder -- although that could break other grammars). I can't get to that this week, but if anybody wants to pick it up it should be a pretty straightforward fix. The Go codebase is pretty simple! πŸ˜„

pchaigno commented 6 years ago

Ok. Thanks for the explanation @vmg! Won't have time either this week, but I'll try to look into it later. I need to better understand the Go code anyway.

Alhadis commented 5 years ago

IMHO, the Missing include in grammar: errors should be downgraded to warnings; or, at the very least, passed through a whitelist of manual approvals (similar to how we're whitelisting certain license hashes).

It's currently impossible for a grammar package to support both GitHub and their respective editors if two grammars disagree on their scopeName . For example, I can't publish this or even view its Lex highlighting locally correctly because Atom's C++ grammar uses source.cpp as its scopeName, whereas Linguist's C++ grammar uses source.c++ instead.

The logical solution would be to include both:

patterns: [
    {include: "source.c++"}
    {include: "source.cpp"}
]

... which the compiler will flag as an error.

eric-wieser commented 3 years ago

vendor/grammars/Lean.tmbundle is no longer part of the repo so can be removed from this list.

lildude commented 3 years ago

The overall list needs updating, but Lean isn't in the clear. IIRC the new grammar has even more issues than the old one currently listed πŸ˜‰

I'll update the list tomorrow.

lildude commented 3 years ago

I've updated the OP with the current, as of 2 Jun 2021, grammar issues.

eric-wieser commented 3 years ago

syntaxes/lean-markdown.json is copied from a vs-code markdown grammar, and modified to exit the markdown scope early, to support things like /- a comment with unpaired `markdown syntax -/ without ` consuming the -/ and the rest of the document.

It produces the errors you see here presumably because the dependencies of the vscode markdown grammar are also missing. I don't think there's much we can do beyond maintaining an entirely separate grammar for github.

Alhadis commented 3 years ago

@lildude Could we downgrade "Missing include in grammar:… the scope cannot be found" errors to build-time warnings, please? The scenario @eric-wieser described is far from an isolated case, or even an unlikely one. Many grammars will attempt to include community grammars if they're available, and not all of them will be recognised by Linguist. It's different to include:Β #someRule or include: recognised-grammar-scope#misspelt-rule, which are obviously errors that need to be brought to upstream's attention.

Moreover, I'd ditch the Unknown keys in grammar errors too, because every grammar format we support uses general-purpose data formats that may support additional case-specific data unrelated to syntax highlighting. Several times I've wanted to attach inline metadata to my grammars, but held off knowing it'd cause issues with Linguist. As in, harmless stuff like this:

block:
    name:  "meta.block"
    begin: "(\\w+)\\s*({)"
    end:   "}"
    beginCaptures:
        1:
            name: "entity.name.block"
        2:
            name: "punctuation.definition.block.begin"
            fold: ">1"
            indent: "increase"
    endCaptures:
        0:
            name: "punctuation.definition.block.end"
            fold: "<1"
            indent: "decrease"
lildude commented 3 years ago

Could we downgrade "Missing include in grammar:… the scope cannot be found" errors to build-time warnings, please?

It's on my massive todo list 😁. I've even got a branch locally that I've started this but haven't got it into a workable state yet as I've not had the time.

Moreover, I'd ditch the Unknown keys in grammar errors too, because every grammar format we support uses general-purpose data formats that may support additional case-specific data unrelated to syntax highlighting.

I've considered this, but I'm not sure if this could accidentally let problems through without anyone noticing. I'd need to do more research.

ghost commented 3 years ago

The MATLAB "Unknown keys in grammar" error is resolved in mathworks/MATLAB-Language-grammar#38 πŸ˜„ along with many other improvements

ghost commented 3 years ago

@lildude when you get the time, please βœ… the MATLAB tasks πŸ˜„

tmillr commented 1 year ago

I'm not sure where to post this so I'm just gonna post it here. I came across a TypeScript rendering/syntax highlighting issue the other day here on GitHub. I sent in a support ticket and they redirected me to this repo.

The issue can be viewed here. Thanks

lildude commented 1 year ago

@tmillr grammar problems should be reported with the upstream maintainers. In the case of TypeScript it is this repo. As this is a tree-sitter grammar, Linguist has no control over the updates so it's possible this issue has been fixed but not yet pulled into GitHub.

DaelonSuzuka commented 1 year ago

@lildude The Godot Engine grammars should have been fixed in https://github.com/godotengine/godot-vscode-plugin/pull/416.

I attempted to validate the fixes using the grammar compiler instructions you gave us, but I don't know how to actually run linguist to make 100% sure that they're working now.

lildude commented 1 year ago

I don't know how to actually run linguist to make 100% sure that they're working now.

Running Linguist wouldn't help you as it doesn't actually do the highlighting. This is done by an internal service so you can only go based on what the validator says, or not.

dragoncoder047 commented 1 year ago

The Python grammar referenced by this repo (MagicStack/MagicPython@7d0f2b22a5ad8fccbd7341bc7b7a715169283044) includes support for the new Python match/case keywords but I'm not seeing them highlighted (see here). Is this just caused by the latest Linguist not being deployed to the Github backend servers (why would it take three months to do that??), or is there some other bug?

Alhadis commented 1 year ago

@dragoncoder047 Unfortunately, GitHub uses a specialised Tree-Sitter parser for Python; the MagicPython grammar is only used to highlight code-blocks in comments.

dragoncoder047 commented 1 year ago

@dragoncoder047 Unfortunately, GitHub uses a specialised Tree-Sitter parser for Python; the MagicPython grammar is only used to highlight code-blocks in comments.

I guess it's a bug in the grammar then. It's already been reported (tree-sitter/tree-sitter-python#141) so I won't bother re-filing it.

2colours commented 1 year ago

Hello hello,

I don't know how this issue works but we supposedly eliminated the \p problems from the Raku repository. I'm curious about the result...

lildude commented 1 year ago

@2colours things look better, but it doesn't appear all have been resolved yet:

➜ git submodule update --remote vendor/grammars/atom-language-perl6
Submodule path 'vendor/grammars/atom-language-perl6': checked out '190e4b38d53548b23263f9c399cd5172421aa057'
➜ script/grammar-compiler update -f
latest: Pulling from linguist/grammar-compiler
[...]
Status: Downloaded newer image for linguist/grammar-compiler:latest
docker.io/linguist/grammar-compiler:latest
 442 / 442  100.00% 8s
done! processed 442 grammars

- [ ] repository `vendor/grammars/atom-language-perl6` (from https://github.com/perl6/atom-language-perl6) (4 errors)
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.tmLanguage.json`) contains a malformed regex (regex "`(?x) ( [\p{Digit}\pL\pM'\-_]+ ) `...": unknown property name after \P or \p (at offset 16))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.tmLanguage.json`) contains a malformed regex (regex "`[\p{Digit}\pL\pM'\-_]+`": unknown property name after \P or \p (at offset 9))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.tmLanguage.json`) contains a malformed regex (regex "`(?x)(?<!\\)(\$|@|%|&)(?!\$)(`...": unknown property name after \P or \p (at offset 141))
  - Invalid regex in grammar: `source.raku` (in `grammars/raku.tmLanguage.json`) contains a malformed regex (regex "`(?x)(\$|@|%|&)(\.|\*|:|!|\^|~|`...": unknown property name after \P or \p (at offset 131))
[...]
2colours commented 1 year ago

Not gonna lie, I'm perfectly clueless how Digit could remain when I even noted that it can/should be replaced to Nd... anyway, soon to be addressed.

EDIT: here goes nothing... should be good now 🀞

2colours commented 1 year ago

@lildude Ping?

lildude commented 1 year ago

@2colours Pong? 🀣

All looks good now. You'll see the benefit (if there's anything noticeable) when the next release is made. Thanks for addressing these issues πŸ™‡

AdamRaichu commented 8 months ago

Should I be mentioning new issues I find in this thread?

lildude commented 8 months ago

@AdamRaichu No. This is only for issues picked up by the grammar compiler.