github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.23k stars 4.23k forks source link

Erlang escript bundle is treated as JavaScript #236

Closed ztmr closed 11 years ago

ztmr commented 12 years ago

Escript bundle is a compressed Erlang script. Linguist detect it incorrectly as a JavaScript:

$ file ./rebar
./rebar: a escript script text executable
$ linguist ./rebar
./rebar: 0 lines (0 sloc)
  type:      Binary
  mime type: text/plain
  language:  JavaScript
$

...so many Erlang projects that are shipped with rebar build tool script may be detected as JavaScript projects alghough they are pure-Erlang!

asabil commented 11 years ago

I am experiencing the same issue, but I also noticed recently that it will mis-detect erlang as Perl...

stuartpb commented 11 years ago

https://github.com/basho/luke for an example of this behavior.

ztmr commented 11 years ago

Hm, it's strange that the problem still exists on some repositories because few weeks ago, it was fixed at least on the repository (http://github.com/ztmr/egtm) where I have discovered the issue for the first time. That's why I thought somebody silently fixed it in meantime...

stuartpb commented 11 years ago

Well, if the file doesn't have an extension, Linguist will classify it based on the result of a Bayesian analysis based on the tokens in lib/linguist/samples.json, so what language it (mistakenly) decides the blob is going to be depends on the frequency of the tokens in that particular file.

There are a few things that can be done to fix this bug and others of its type:

  1. Add rebar to lib/linguist/vendor.yml - Proposed as #443.
  2. Add a "SHEBANG#!escript" token to lib/linguist/samples.json. This should fix any future instances of other Escript bundles being recognized as something else. I'm not familiar with Erlang or Linguist, so I don't know how much you want escript files with extensionless names other than rebar recognized in a project, or if they should always be treated as vendor, and if so how that would be done in Linguist.
  3. Don't count any file whose type is determined with less than, say, 90% certainty in the language breakdown. This would fix any other misclassified files Linguist doesn't recognize.
  4. Maybe use file(1) or one of the at least 4 gems that bind to magic(4) before resorting to Bayesian classification(?!).
tnm commented 11 years ago

This is fixed with #443, and the fix will be out on the website soon.