Closed eric-wieser closed 3 years ago
Perhaps the short-term ask here is for linquist to document which regex engine it uses.
Perhaps the short-term ask here is for linquist to document which regex engine it uses.
You mean like this? 😉
You can also try to fix the bug yourself and submit a pull-request. TextMate's documentation offers a good introduction on how to work with TextMate-compatible grammars. Note that Linguist uses PCRE regular expressions, while TextMate uses Oniguruma. Although they are mostly compatible there might be some differences in syntax and semantics between the two. You can test grammars using Lightshow.
This is really an implementation decision made in GitHub.com as PCRE is much more performant so we can't change things in Linguist.
Workaround: replace [a-z&&[^aeiou]]+
with ((?![^aeiou])[a-z])+
instead. 😉
Or—if the universal set allows—write them out in full: [b-df-hj-np-tv-z]+
(this tends to be more performant than the aforementioned hack).
Not sure how I missed that bit of the docs, thanks. I'm using the aforementioned hack because I don't fancy trying to work out adjacent unicode codepoints in my regex [_a-zA-Zα-ωΑ-Ωϊ-ϻἀ-῾℀-⅏𝒜-𝖟0-9'ⁿ-₉ₐ-ₜᵢ-ᵪ&&[^λΠΣ]]
['0-8A-Y_a-yΑ-Ψα-κμ-ψϊ-Ϻᵢ-ᵩἀ-´ⁿ-₈ₐ-ₛ℀-ⅎ𝒜-𝖞]+
Generated from the following hacky JavaScript snippet:
function expandRanges(text){
return text.replace(/([^-\\]|\\.)-([^-\\]|\\.)/gu, (text, from, to) => {
text = "";
from = from.codePointAt(0);
to = to.codePointAt(0);
for(let i = from; i < to; ++i)
text += String.fromCodePoint(i);
return text;
});
}
... using a not-so-hacky JavaScript function I wrote a few nights ago, funnily enough. It doesn't check for things like a leading ^
or escaped carets, but hey, I said it was hacky…
I think you have some off-by-one errors there, you stole my 9
, z
, and Z
!
I think you have some off-by-one errors there
Oh for fuck sake, did I make that mistake again?
-for(let i = from; i < to; ++i)
+for(let i = from; i <= to; ++i)
Try that. I literally wasted an hour the other night trying to figure out what happened to a missing character…
Closing as I believe this has been addressed.
Describe the bug
Textmate language grammars, as consumed by linguist, are specified (at least, according to the vscode docs) to use the "Onigurama" regex engine, which supports a pattern like
[_a-zA-Z&&[^w]]
.Expected behaviour
The following language:
applied to the following input
should highlight both
good
andGOOD
.In vs-code, this is exactly what it does.
In https://github-lightshow.herokuapp.com/, it highlights only
GOOD
.Related discussion
Additional notes