I started working on this because I don't feel the current documentation fits the users' expectations. They arrive here with a specific problem and must devise a solution by reading a complete description of how Linguist works (overview, overrides, troubleshooting, etc.). I'm guessing this is part of the reason why so many users don't read the documentation before filling in the issue template.
An FAQ seems more fitting. We can address the most common issue with concise explanations and adapt it as users file new issues. I think it should be extensive without aiming to be comprehensive. I'm imagining a single Markdown document with a summary of questions at the top and 1-4 lines answers to each.
I'm opening this issue to try and constitute a complete list of questions before we try to answer them. I established an initial list based on 1) past, closed issues and 2) questions found on stackoverflow.com. Some of these are "incorrect" (e.g., How does Linguist highlight files?) because the idea is to fit the user's questions and then explain why, in some cases, it's incorrect and doesn't work as they expected.
Without further ado, here's my initial list of questions/entries:
How can I change the language of my repository?
The first language from the language statistics is sometimes shown next to the repository's name. If you believe the language statistics for your repository are incorrect, please see [The language statistics in my repository are wrong](). If you believe we missed a language or one of its extensions, please consider submitting a pull request if said extensions meet [the requirements](). If everything looks correct, but you'd still like another language to appear first, please consider using Linguist overrides.
The language detected for some files in my repository is incorrect.
Using Linguist overrides, you can tell Linguist what's wrong.
It's also likely you can help us improve things. If you believe we don't support a language we should, and if that language is [widespread enough](), you can send us a pull request. It is also possible to improve classification for existing languages by adding new sample files or by adding/improving heuristic rules.
I found a syntax highlighting error.
Linguist detects the language of a file, but the actual syntax highlighting is powered by a set of language grammars which are included in this project as a set of submodules as listed here.
If you experience an issue with the syntax highlighting on GitHub, please report the issue to the upstream grammar repository, not here. Grammars are updated automatically with every new release.
When I click on a language in the statistics bar, no corresponding files are found in the search page.
This is a known bug that (unfortunately) doesn't fall under the purview of Linguist. Please contact GitHub support.
No language is detected in my repository.
Only [programming and markup languages]() are counted in the statistics. Vendored, documentation, and generated files are also excluded. Please also consider that Linguist runs as a low priority background job and it may therefore take some time for the languages to appear after you pushed to the repository.
Considering all this, if you still believe the repository should display a language, you can try to run Linguist locally on your repository or you can open an issue.
What are markup or programming languages?
In Linguist, each language has a type, which are documented in languages.yml.
How does Linguist detect the language of a file?
Linguist relies on the following strategies, in order, and returns the language as soon as it found a perfect match (strategy with a single language returned).
Known filename. Some filenames are associated to specific languages (think Makefile).
Look for a shebang. A file with a #!/bin/bash shebang will be classified as Shell.
Known file extension. Languages have a set of extensions associated to them. There are, however, lots of conflicts with this strategy. The conflicting results (think C++, C and Objective-C for .h) are refined by the subsequent strategies.
A set of heuristic rules. They usually rely on regular expressions over the content of files to try and identify the language (e.g., ^[^#]+:- for Prolog).
A naive Bayesian classifier trained on sample files. Last strategy, lowest accuracy. The Bayesian classifier always takes a subset of languages as input; it is not meant to classify among all languages. The best match found by the classifier is returned.
When will my pull request for Linguist take effect on github.com?
Changes to Linguist take effect on github.com with each new release, usually once a month.
When will changes in a syntax highlighting grammar take effect on github.com?
Changes to any syntax highlighting grammar will take effect with the next release of Linguist, usually once a month.
I changed a syntax highlighting grammar. Do I need to open a pull request on Linguist for it to take effect on github.com?
No. Grammars are updated automatically with every new release.
The language statistics in my repository are wrong.
The percentages in the statistic bar are calculated based on the total bytes of code for each [programming or markup language](), after excluding vendored, generated, and documentation files. Considering this, if you believe the statistics are incorrect, it is likely that some files were incorrectly classified. Please read [The language detected for some files in my repository is incorrect]() to fix it.
If you believe Linguist should already recognize these files as generated, you can submit a pull request to improve our identification of generated files.
What are the requirements to associate a new extension to a language?
We prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist. In particular, we are wary of adding new languages for common extensions as they may conflict with other languages and cause misclassifications.
What are the requirements to add support for a new language?
We prefer that each new file extension be in use in hundreds of repositories before supporting them in Linguist. In particular, we are wary of adding new languages with common extensions as they may conflict with other languages and cause misclassifications.
How can I search for repositories that are using the language I want to add to Linguist?
As you may have noticed, GitHub doesn't offer a way to search for repositories containing files with a particular file extension. Instead, we recommend you use the Code search. We will then use the Harvester tool to deduce the number of repositories from the number of files. You may need to add NOT randomstring to your search query for GitHub to allow you to search file only by their extension. If several languages use that extension, you will need to add keywords to your search query to obtain a conservative estimation of the number of files for your particular language.
Why aren't my files syntax highlighted?
There can be three reasons. Either the language for that file is not supported by Linguist, or Linguist doesn't have a grammar to highlight files from that language, or Linguist was unable to properly detect the language.
If [the language is not supported by Linguist](), and you believe it meets [the requirements for support](), please consider submitting a pull request.
To check if Linguist has a grammar for the language, you can check the list of grammars. If it doesn't and you know a Sublime Text, Atom, or TextMate grammar that would work, please consider submitting a pull request.
If Linguist supports the language and it has a grammar, the lack of syntax highlighting is probably the result of a misclassification. Please read [The language detected for some files in my repository is incorrect]() to fix it.
How can I check if Linguist supports a given language?
The list of supported languages is listed in languages.yml, with the associated extensions, shebangs, and filenames.
How do I disable syntax highlighting for a file?
You can disable syntax highlighting by telling Linguist the file is a Text file:
How are the language statistics computed?
The percentages in the statistic bar are calculated based on the total bytes of code for each [programming or markup language](), after excluding vendored, generated, and documentation files.
Why are some of my files not counted in language statistics?
Only files with a [markup or programming language]() are counted in statistics. In addition, generated, documentation, and vendored files are excluded from statistics.
What keywords can I use to highlight a code snippet in Markdown?
For each language in the languages.yml file, you can use as specifiers:
the language name;
any of the language aliases;
any of the language interpreters;
any of the file extensions, with or without a leading ..
White spaces must be replaced by dashes (e.g., emacs-lisp is one specifier for Emacs Lisp). Languages with a tm_scope: none entry don't have a grammar defined and won't be highlighted on github.com.
How can I trigger an update of language detection in my repository?
Linguist runs as a low priority background job. It may therefore take a while, particularly during busy periods, for your language statistics bar to reflect your changes. To trigger a new analysis, you can push to your repository.
How does Linguist highlight files?
Linguist detects the language of a file, but the actual syntax-highlighting is powered by a set of language grammars which are included in this project as a set of submodules as listed here.
Can I define my own syntax highlighter for files in my repository?
If you wrote a syntax highlighter (Sublime Text, Atom, or TextMate grammar) for a language Linguist already support and it performs better than the syntax highlighter Linguist currently uses, please submit a pull request! If Linguist doesn't support said language, you will first need to add support for it (please see [the requirements]() first). GitHub doesn't currently offer a way to define custom syntax highlighter for unsupported languages.
Any additions? Questions you think should be broken down into several questions? Questions you don't think are frequent enough to warrant an entry?
EDIT: I also plan to write down new issue templates for the few very common cases once we've established the list of possible issues/questions.
I started working on this because I don't feel the current documentation fits the users' expectations. They arrive here with a specific problem and must devise a solution by reading a complete description of how Linguist works (overview, overrides, troubleshooting, etc.). I'm guessing this is part of the reason why so many users don't read the documentation before filling in the issue template.
An FAQ seems more fitting. We can address the most common issue with concise explanations and adapt it as users file new issues. I think it should be extensive without aiming to be comprehensive. I'm imagining a single Markdown document with a summary of questions at the top and 1-4 lines answers to each.
I'm opening this issue to try and constitute a complete list of questions before we try to answer them. I established an initial list based on 1) past, closed issues and 2) questions found on stackoverflow.com. Some of these are "incorrect" (e.g., How does Linguist highlight files?) because the idea is to fit the user's questions and then explain why, in some cases, it's incorrect and doesn't work as they expected.
Without further ado, here's my initial list of questions/entries:
languages.yml
.Makefile
).#!/bin/bash
shebang will be classified as Shell..h
) are refined by the subsequent strategies.^[^#]+:-
for Prolog).If you believe Linguist should already recognize these files as generated, you can submit a pull request to improve our identification of generated files.
NOT randomstring
to your search query for GitHub to allow you to search file only by their extension. If several languages use that extension, you will need to add keywords to your search query to obtain a conservative estimation of the number of files for your particular language.languages.yml
, with the associated extensions, shebangs, and filenames.languages.yml
file, you can use as specifiers:aliases
;interpreters
;.
. White spaces must be replaced by dashes (e.g.,emacs-lisp
is one specifier forEmacs Lisp
). Languages with atm_scope: none
entry don't have a grammar defined and won't be highlighted on github.com.Any additions? Questions you think should be broken down into several questions? Questions you don't think are frequent enough to warrant an entry?
EDIT: I also plan to write down new issue templates for the few very common cases once we've established the list of possible issues/questions.
Template removed as it doesn't apply.