Open k----n opened 3 years ago
I recently checked the linguist library is not very comprehensive (and also had some glaring issues). I also use ctokens and file to derive this list empirically. It may have some specifc types that it might capture more accurately, however. So adding it as well would be good (though I recall it may be written in java eww..)
It looks like linguist is a ruby gem: https://github.com/github/linguist#usage With CLI: https://github.com/github/linguist#command-line-usage and it can run against single files:
$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
type: Text
mime type: text/x-yaml
language: YAML
In terms of programming language extensions, there appears to 395 here: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml
With other types of files able to be classified too like data, markup, prose, e.g.
CSV:
type: data
...
extensions:
- ".csv"
HTML:
type: markup
...
aliases:
- xhtml
extensions:
- ".html"
- ".htm"
- ".html.hl"
- ".inc"
- ".st"
- ".xht"
- ".xhtml"
Markdown:
type: prose
...
aliases:
- pandoc
...
extensions:
- ".md"
- ".markdown"
- ".mdown"
- ".mdwn"
- ".mdx"
- ".mkd"
- ".mkdn"
- ".mkdown"
- ".ronn"
- ".workbook"
It also looks like coffee.md
is classified as its own programming language (even though it compiles into javascript https://coffeescript.org/):
Literate CoffeeScript:
type: programming
...
aliases:
- litcoffee
extensions:
- ".litcoffee"
- ".coffee.md"
I'd argue that Coffeescript is like Typescript and it should be its own language -- they both compile to Javascript.
I guess right now only programming language types are covered in WoC, but it would be good to cover other file types like data or markup to better characterize what a project actually does.
I'd be happy to support other types in WoC.
What are your thoughts? A project for another hackathon?
Some of the problems may be with inaccuracies that miss or misidentify the type and parsing of dependencies does not work. Another, it seems to have a very large number of languages, many will not be used much, if at all
It appears to me that looking at file extensions is used to determine the file types here: https://github.com/ssc-oscar/lookup/blob/5e78bbfe322a83f425c2cf8d7982d2be4e82d79b/prjSummary.perl#L114-L147
Would it be better to use the github linguist library instead? https://github.com/github/linguist