github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.32k stars 4.27k forks source link

Improve detection of Tcl scripts #254

Closed helium-software closed 11 years ago

helium-software commented 12 years ago

In my Tcl-based projects, I have some Tcl scripts without .tcl filename ending - namely those that are directly executable (in contrast to other Tcl files that are sourced etc.) Unfortunately, Linguist seems to recognize them as Perl scripts, although they are easily distinguishable by their shebang line, which reads #!/usr/bin/tclsh8.5 or #!/usr/bin/wish8.5. I would recommend treating any file as Tcl script, if the shebang line contains /tclsh or /wish. (However, I'm not sure if the current linguist inspects shebang lines, as this is only mentioned in the README of the fork at mleinart.)

Another class of files being wrongly classified are the Tcl modules (http://wiki.tcl.tk/12999), which have a mandatory filename ending of .tm. They are normal Tcl scripts, but may contain binary data after an end-of-file (^Z) character. (In this case, they should of course be treated as "generated files".) Is it possible to assign the ending .tm to Tcl?

Thanks, helium-software

EDIT: forgot to mention tclIndex and pkgIndex files; they are special Tcl scripts that instruct Tcl about files available for "auto-loading" and should always be considered as "generated files". (Honestly, I didn't check if this is already implemented, but my tclIndex file shows up as plaintext.)

tnm commented 11 years ago

Hi — I can't really take action on this issue, but I'm open to a PR to improve this issue!

helium-software commented 11 years ago

After a ~30min study of the linguist code (involving a grep -r shebang * on the downloaded project tree), I first came to believe that language recognition by shebang line is not really implemented. I found only a fragment in lib/linguist/language.rb that reads:

      # A bit of an elegant hack. If the file is exectable but extensionless,
      # append a "magic" extension so it can be classified with other
      # languages that have shebang scripts.
      if File.extname(name).empty? && mode && (mode.to_i(8) & 05) == 05
        name += ".script!"

So, (part of) the mechanism needed seems there, but a grep -r script! * gave me only few results, one of them being lib/linguist/samples.json.

This brings me to the most important observation: There are no Tcl samples! – neither in samples.json nor anywhere in the samples/ directory. Would adding Tcl samples to those places perhaps fix my problems?

tnm commented 11 years ago

@helium-software Yes! Please add samples to the samples directory. That will help a lot.

helium-software commented 11 years ago

General question: Linguist appears to be using the samples for a "Bayesian Classifier" machine-learning algorithm (i.e. they are not merely test cases). So, does choosing them influence recognition quality? Well, Tcl seems to be very easily distinguishable (no use of round brackets; special keywords like set, proc, lindex, lappend, ...). But I'm not sure if taking some things out of my tcl-misc repository is enough for this purpose. Otherwise, I'm going to search typical examples in the Tcl Wiki.

Anyway, expect a pull request from me in a few days...

EDIT: I'm just remembering myself here, that I should take scripts with all names of interpreters, being wish, tclsh, wish8.4, tclsh8.6, etc., in the shebang line.

helium-software commented 11 years ago

Hi

I haven't yet taken the time to search good examples in the Tcl Wiki. Anyway, in my opinion the current handling of shebang lines in Linguist is conceptually insufficient. Having nice Tcl samples around (with all sorts of shebang lines) might or might not be enough for a script to be recognized as Tcl, depending on how the Classifier has calculated its probabilistic model. So I'd prefer shebangs being listed in languages.yml and handled like extensions are. See my post at Issue https://github.com/github/linguist/issues/264#issuecomment-15121196.

I think I should open this as a new Issue (meaning a feature request), since that feature is not only relevant to the recognition of Tcl (this Issue) resp. Python (Issue #264).