github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.27k stars 4.25k forks source link

Linguist recognizes file specified in .gitattributes locally correctly but not on GitHub #4251

Closed EmilGedda closed 5 years ago

EmilGedda commented 6 years ago

Preliminary Steps

Please confirm you have...

Problem Description

GitHub does not correctly recognize a file as code (or text) even though it is specified in .gitattributes

The file type_traits is not correctly recognized as a C++ file on GitHub, even though my local Linguist install correctly recognizes it. The language percentage stats on GitHub are the same as what Linguist reports, but no syntax highlighting occurs.

The other file cstddef in the same directory is also defined in .gitattributes and is correctly recognized as C++.

I tried using a wildcard std/* in .gitattributes but it made no difference.

When listing all files by language neither file shows up, even though cstddef is correctly highlighted.

The only real difference between the files is the underscore in type_traits, however I can't image that that is the issue?

URL of the affected repository: https://github.com/EmilGedda/kos

Last modified on: 2018-08-27

Expected language: C++

Detected language:

pchaigno commented 6 years ago

First, thanks for the report and for properly filling in the template!

This sort of issues usually has to do with the background thread processing repositories lagging behind and cache issues for the search results. To check this, I 1) cloned the repository locally and pushed it to another GitHub repository and 2) cloned the repository locally, erased all git history, and pushed it to another GitHub repository. In both cases, I observed the exact behavior you did. I performed the second test only in case GitHub caches Linguist results across repositories.

Because of this, I'm thinking 1) the files don't appear in the search results because of the usual caching issues, and 2) the file may not be highlighted because of a grammar issue. We've already had this second issue in the past once, for a very large file for which a timeout was hit. Even though the file is small here, it could still be a similar issue (not necessarily a timeout).

@lildude We're going to need your help here :bowing_man:

EmilGedda commented 6 years ago

Both of the files were pushed in the same commit, and all other new files in that commit got syntax highlighted. Is it so that the highlighter starts jobs on a per changed file basis and not on a per commit basis?

I might be able to reduce type_traits to a more minimal example tomorrow which still breaks the highlighter, if it is not a caching issue that is.

Sidenote: The file is a totally valid c++17 file.

lildude commented 6 years ago

I don't have much free time this week as I'm at a conference, but I've taken a quick look "behind the scenes", and this file is definitely being picked up as C++ by Linguist. I've forked your repo and reproduced the issue, and then "fixed" it - 🎉 https://github.com/lildude/kOS/blob/master/std/type_traits is highlighted.

So what did I do? I added a new line after the pragma line (I also added a new line before it with the same successful results). I've seen this issue before, though can't remember which issue or which grammar was involved (a search through the Linguist issues will probably find it) but I recall doing this same experiment then.

I'm not sure off the top of my head why this works.

pchaigno commented 6 years ago

It's kinda weird that it works fine with Lightshow without the changes... I was expecting a bug on the grammar's side.

lildude commented 6 years ago

It's kinda weird that it works fine with Lightshow without the changes... I was expecting a bug on the grammar's side.

Yeah, that's why I'm not sure what the cause is. I wish I could find the other issue in which I mentioned this as I may have left a few more clues there. If anything, it'd be nice to know if that case was C++ too... I have a sneaking suspicion is was.

lildude commented 6 years ago

Fun fun fun. Simply adding a space to the end of the pragma line solves the too.

lukateras commented 6 years ago

I think I hit a similar issue in https://github.com/serokell/mix2nix (see .giattributes).

lildude commented 6 years ago

@yegortimoshenko I don't think your issue is quite the same. Your .nix.eex files are being detected as HTML as your override isn't taking effect. This isn't taking effect because you've got a typo in your .gitattributes:

*.nix.eex lignuist-language=Nix
            ^--- letter switcherooo 

Once you correct this, I'd expect the files to be correctly identified as Nix and highlighted accordingly.

One word of warning... there's currently a backlog of background jobs so it may take a while for languages stats to be updated.

lildude commented 6 years ago

@yegortimoshenko I don't think your issue is quite the same.

I've forked your repo, corrected the .gitattributes and now the files are correctly classified in the language stats, but the syntax highlighting isn't being applied unless I make a change to one of those files. This leads me to believe we've got a caching issue somewhere on the GitHub side of things some how related to files initially being incorrectly identified and then overruled by .gitattributes.

I'm not familiar with the caching side of things so it may take me a while to track this down. I'll only be able to really dig into this late next week.

Pinging @vmg and @kivikakk just in case you're more familiar with the caching and if so can point me in the right direction.

EmilGedda commented 6 years ago

This leads me to believe we've got a caching issue somewhere on the GitHub side of things some how related to files initially being incorrectly identified and then overruled by .gitattributes.

Yes this seems to be exactly it. I created a 2-commit repo where this caching issue manifests itself.

Repro:

  1. Commit a source file which Linguist is not able to classify
  2. git push
  3. override the linguist classification in .gitattributes
  4. git push

After the last push the stats for the repo updates correctly, but not the highlighting. The file is succesfully highlighted when the next update to the file occurs.

lildude commented 6 years ago

A quick update on this: this is definitely due to the way caching currently takes place on GitHub.com. The caching is done based on the blob so if the blob/file doesn't change, the cached content isn't updated hence the syntax highlighting doesn't appear and explains why making a simple change renders things correctly.

wisn commented 6 years ago

Hi! I'm encountering the same problem. My repository (wisn/cppds)language stats doesn't shown up. Is there any ways to make it works? Do I need to wait until this problem fixed by @lildude? Thanks!

lildude commented 6 years ago

Do I need to wait until this problem fixed by @lildude?

@wisn Nope. You're hitting a different feature of Linguist. As all your files are under /examples they're being treated as documentation thanks to:

https://github.com/github/linguist/blob/8cd9d744caa7bd3920c0cb8f9ca494ce7d8dc206/lib/linguist/documentation.yml#L16

You'll need to add an override to exclude that directory from documentation detection as per the details at https://github.com/github/linguist#documentation

wisn commented 6 years ago

That's new for me. It is works now. Thanks, @lildude!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

pchaigno commented 5 years ago

@lildude Does this need to be reopened? Or has the issue been fixed?

lildude commented 5 years ago

@lildude Does this need to be reopened? Or has the issue been fixed?

I don't think this should be re-opened as this is not a problem that Linguist has any influence over. The root cause is the way we cache syntax highlighting on GitHub.com. From https://github.com/github/linguist/issues/4251#issuecomment-417735924 above:

A quick update on this: this is definitely due to the way caching currently takes place on GitHub.com. The caching is done based on the blob so if the blob/file doesn't change, the cached content isn't updated hence the syntax highlighting doesn't appear and explains why making a simple change renders things correctly.

I've opened an issue for this on the GitHub side of things and referenced this issue when I made that comment above (I actually opened the issue two days before 😄).

This is now up to the my colleagues on that side of things to resolve. We can't influence this from Linguist.

pchaigno commented 5 years ago

I've opened an issue for this on the GitHub side of things and referenced this issue when I made that comment above

:+1: