github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.11k stars 4.19k forks source link

Not all BUILD files are Starlark (Starlark overrides shell) #5008

Open v4hn opened 3 years ago

v4hn commented 3 years ago

Preliminary Steps

Please confirm you have...

Problem Description

I had never heard of Starlark before, but apparently it overrides regular shell scripts in many contexts. Over 50% of our package build instructions which are mostly bash scripts are recognized as Starlark at the moment. Here is one rather clear example, but the search gives you many others.

Adding shebang lines in each file is not an option because it simply bloats all descriptions. We might add overrides in the future because many of the files are ambiguous with Python as well, but seeing a rather unusual variant of python being detected so prominently seems like an issue that should be resolved.

URL of the affected repository:

https://github.com/lunar-linux/moonbase-core

Last modified on:

2020/09/14

Expected language:

Shell

Detected language:

Starlark

smola commented 3 years ago

@v4hn Starlark is the language used by Bazel. Linguist currently classifies BUILD files as Starlark without running any further heuristic.

You can override this with a single .gitattributes file in your repos: https://github.com/github/linguist#overrides

v4hn commented 3 years ago

Thanks for the pointer @smola ! The better solution would be to look into files instead of deciding based on a plain file name without extension. :-)

lildude commented 3 years ago

The better solution would be to look into files instead of deciding based on a plain file name without extension. :-)

Linguist does, but only if, in this case, the filename is associated with more than one language. As has already been pointed out, BUILD is only associated with one language so Linguist will return that straight away and not look into the content. To get it to move onto the heuristics and later steps that look at the content, support would need to be added for another language.

Please feel free to submit a PR to add support and a heuristic, though adding a heuristic. Given these are shell scripts, and from the few I've looked at in your repo, without shebangs, it might be easier to write the heuristic to identify the Skylark files and fallback to shell.