github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.36k stars 4.27k forks source link

Regex messes up highlighting for most of the rest of the page #2839

Closed TWiStErRob closed 7 years ago

TWiStErRob commented 8 years ago

This regex messes up highlighting for most of the rest of the page:

state :descriptors do
# a full-form tag
rule /!<[0-9A-Za-z;\/?:@&=+$,_.!~*'()\[\]%-]+>/, Keyword::Type
# a tag in the form '!', '!suffix' or '!handle!suffix'
rule %r(
  (?:![\w-]+)? # handle
  !(?:[\w;/?:@&=+$,.!~*\'()\[\]%-]*) # suffix
)x, Keyword::Type
# an anchor
rule /&[\w-]+/, Name::Label
# an alias
rule /\*[\w-]+/, Name::Variable
end

from https://github.com/jneen/rouge/blob/master/lib/rouge/lexers/yaml.rb#L152

That apostrophe which starts the single quoted string (blue colored from there on) is in a character class (delimited by []), so it shouldn't have any special meaning. If I remove the apostrophe then the percent sign does the same. If I remove all special chars from the character class, it still breaks because of the >/ which I can't explain.

I reported it to GitHub support, which lead me here:

We use open source TextMate-style language grammars for syntax highlighting, which are available here: https://github.com/github/linguist/blob/master/grammars.yml Linguist pulls in grammar updates with each new release, which usually happens every couple of weeks.

So based on https://github.com/github/linguist/blob/master/grammars.yml#L480 The code is at https://github.com/aroben/ruby.tmbundle/tree/4636a3023153c3034eb6ffc613899ba9cf33b41f

The problem is that I see that it is a fork and is on a branch. It's about 20 commits behind and 2 minor commits ahead of the official master which is so much improvement in the offical that GitHub can't even show the diff: https://github.com/aroben/ruby.tmbundle/compare/pl...textmate:master

Syntaxes/Ruby.plist 2,663 additions, 1,710 deletions not shown because the diff is too large. Please use a local Git client to view these changes.

It's possible that this issue was already fixed by those 20 commits, would it be possible to pull those 2 minor changes to the original repo and use that so the updates are received from other contributors?

/cc @aroben

aroben commented 8 years ago

Current highlighting: https://github-lightshow.herokuapp.com/?utf8=✓&scope=from-url&grammar_url=https%3A%2F%2Fgithub.com%2Faroben%2Fruby.tmbundle%2Fblob%2Fmaster%2FSyntaxes%2FRuby.plist&grammar_text=&code_source=from-url&code_url=https%3A%2F%2Fgithub.com%2Fjneen%2Frouge%2Fblob%2Fmaster%2Flib%2Frouge%2Flexers%2Fyaml.rb&code=

If we use textmate/ruby.tmbundle instead, it looks like this: https://github-lightshow.herokuapp.com/?utf8=✓&scope=from-url&grammar_url=https%3A%2F%2Fgithub.com%2Ftextmate%2Fruby.tmbundle%2Fblob%2Fmaster%2FSyntaxes%2FRuby.plist&grammar_text=&code_source=from-url&code_url=https%3A%2F%2Fgithub.com%2Fjneen%2Frouge%2Fblob%2Fmaster%2Flib%2Frouge%2Flexers%2Fyaml.rb&code=

As far as I can tell the rendering is the same regardless of which version of the grammar is used, so moving off of our fork probably wouldn't help. (I'd still like to move off the fork though!)

TWiStErRob commented 8 years ago

Wow, nice tool! Yeah, I didn't see any particular commit related to this, but the diff was so big I hoped for it. Can I help in any way here?

aroben commented 8 years ago

If you want to fix the regex bug you reported, feel free to work the grammar and try out changes using Lightshow. Moving off our fork is waiting on https://github.com/textmate/ruby.tmbundle/pull/73; perhaps a comment in that PR could help move things along.

TWiStErRob commented 8 years ago

I tried to test the editable grammar in Lightshow but I was getting HTTP 414 Request-URI Too Long from the server. I even tried if zipping the grammar source would be a viable option, but that's too long too:

<input type="hidden" name="grammar_text">
<textarea id="grammar_text_raw" class="js-optional-field code form-control hidden">
...
<script language="text/javascript">
submitter.addEventListener("submit", function() {
    var zip = new JSZip();
    zip.file("source.txt", document.getElementById('grammar_text_raw').value);
    submitter.grammar_text.value = zip.generate({type:"base64", compression:"DEFLATE"});
});

I think the submit method should be changed to POST for long texts, it doesn't work now anyway so you wouldn't lose the linkabily, just allow a little more leeway for use cases.

Anyway, after this I just uploaded a file to somewhere I can edit, and also managed to make the example smaller and runnable, because Sublime Text has the same issue, here's the minimal code (lightshow):

puts %r([0-9'])
puts /[0-9']/ # all from here is part of a string until the next apostrophe
puts "hello"
# this is still a comment, ' but highlighted as code: def x = 1+2

The fix I found was that /.../ needs to be matched at the same time as the %r×...× (string.regexp.mod-r.ruby) versions. Sadly I don't know enough about ruby nor syntax grammars to know if this is the correct approach.

pchaigno commented 7 years ago

@TWiStErRob Thanks for working on this! Please report your findings to the grammar repository, as it is not something that can be fixed in Linguist.