github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.27k stars 4.24k forks source link

&& is not supported in regular expressions #5404

Closed eric-wieser closed 3 years ago

eric-wieser commented 3 years ago

Describe the bug

Textmate language grammars, as consumed by linguist, are specified (at least, according to the vscode docs) to use the "Onigurama" regex engine, which supports a pattern like [_a-zA-Z&&[^w]].

Expected behaviour

The following language:

{
  "name": "Broken",
  "scopeName": "source.broken",
  "patterns": [
     { "match": "[_a-z&&[^bad]]+",
       "name": "entity.name.function"},
     { "match": "[_CE-Z]+",
       "name": "entity.name.function"}
  ]
}

applied to the following input

good
bad
GOOD
BAD

should highlight both good and GOOD.

In vs-code, this is exactly what it does.

In https://github-lightshow.herokuapp.com/, it highlights only GOOD.

Related discussion

Additional notes

eric-wieser commented 3 years ago

Perhaps the short-term ask here is for linquist to document which regex engine it uses.

lildude commented 3 years ago

Perhaps the short-term ask here is for linquist to document which regex engine it uses.

You mean like this? 😉

You can also try to fix the bug yourself and submit a pull-request. TextMate's documentation offers a good introduction on how to work with TextMate-compatible grammars. Note that Linguist uses PCRE regular expressions, while TextMate uses Oniguruma. Although they are mostly compatible there might be some differences in syntax and semantics between the two. You can test grammars using Lightshow.

This is really an implementation decision made in GitHub.com as PCRE is much more performant so we can't change things in Linguist.

Alhadis commented 3 years ago

Workaround: replace [a-z&&[^aeiou]]+ with ((?![^aeiou])[a-z])+ instead. 😉

Or—if the universal set allows—write them out in full: [b-df-hj-np-tv-z]+ (this tends to be more performant than the aforementioned hack).

eric-wieser commented 3 years ago

Not sure how I missed that bit of the docs, thanks. I'm using the aforementioned hack because I don't fancy trying to work out adjacent unicode codepoints in my regex [_a-zA-Zα-ωΑ-Ωϊ-ϻἀ-῾℀-⅏𝒜-𝖟0-9'ⁿ-₉ₐ-ₜᵢ-ᵪ&&[^λΠΣ]]

Alhadis commented 3 years ago
['0-8A-Y_a-yΑ-Ψα-κμ-ψϊ-Ϻᵢ-ᵩἀ-´ⁿ-₈ₐ-ₛ℀-ⅎ𝒜-𝖞]+

Generated from the following hacky JavaScript snippet:

function expandRanges(text){
    return text.replace(/([^-\\]|\\.)-([^-\\]|\\.)/gu, (text, from, to) => {
        text = "";
        from = from.codePointAt(0);
        to   = to.codePointAt(0);
        for(let i = from; i < to; ++i)
            text += String.fromCodePoint(i);
        return text;
    });
}

... using a not-so-hacky JavaScript function I wrote a few nights ago, funnily enough. It doesn't check for things like a leading ^ or escaped carets, but hey, I said it was hacky…

eric-wieser commented 3 years ago

I think you have some off-by-one errors there, you stole my 9, z, and Z!

Alhadis commented 3 years ago

I think you have some off-by-one errors there

Oh for fuck sake, did I make that mistake again?

-for(let i = from; i < to; ++i)
+for(let i = from; i <= to; ++i)

Try that. I literally wasted an hour the other night trying to figure out what happened to a missing character…

lildude commented 3 years ago

Closing as I believe this has been addressed.