Using Linguist to Determine Project Filetypes?

k----n commented 3 years ago

It appears to me that looking at file extensions is used to determine the file types here: https://github.com/ssc-oscar/lookup/blob/5e78bbfe322a83f425c2cf8d7982d2be4e82d79b/prjSummary.perl#L114-L147

Would it be better to use the github linguist library instead? https://github.com/github/linguist

audrism commented 3 years ago

I recently checked the linguist library is not very comprehensive (and also had some glaring issues). I also use ctokens and file to derive this list empirically. It may have some specifc types that it might capture more accurately, however. So adding it as well would be good (though I recall it may be written in java eww..)

k----n commented 3 years ago

It looks like linguist is a ruby gem: https://github.com/github/linguist#usage With CLI: https://github.com/github/linguist#command-line-usage and it can run against single files:

$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
  type:      Text
  mime type: text/x-yaml
  language:  YAML

In terms of programming language extensions, there appears to 395 here: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml

With other types of files able to be classified too like data, markup, prose, e.g.

CSV:
  type: data
  ...
  extensions:
  - ".csv"

HTML:
  type: markup
  ...
  aliases:
  - xhtml
  extensions:
  - ".html"
  - ".htm"
  - ".html.hl"
  - ".inc"
  - ".st"
  - ".xht"
  - ".xhtml"

Markdown:
  type: prose
  ...
  aliases:
  - pandoc
  ...
  extensions:
  - ".md"
  - ".markdown"
  - ".mdown"
  - ".mdwn"
  - ".mdx"
  - ".mkd"
  - ".mkdn"
  - ".mkdown"
  - ".ronn"
  - ".workbook"

It also looks like coffee.md is classified as its own programming language (even though it compiles into javascript https://coffeescript.org/):

Literate CoffeeScript:
  type: programming
  ...
  aliases:
  - litcoffee
  extensions:
  - ".litcoffee"
  - ".coffee.md"

I'd argue that Coffeescript is like Typescript and it should be its own language -- they both compile to Javascript.

I guess right now only programming language types are covered in WoC, but it would be good to cover other file types like data or markup to better characterize what a project actually does.

audrism commented 3 years ago

I'd be happy to support other types in WoC.

How to define what is a language is a bit of a problem , though. TypeScript appears not to be separated
How to determine what language the file contains is another problem. For example, linguist has a set of heuristics other than file or ctags
Why we want languages is also important. One of the goals is to parse dependencies, other to assess popularity
Efficiency is important, while it can be deployed as a container, but it may take a while to go over 8B blobs

One way to proceed is to obtain file extensions in the https://github.com/github/linguist/blob/master/lib/linguist/languages.yml and use them for WoC. If Linguist will be suported in the future, it can be used as a standard.
Another to customize it to run it as I do run ctags and measure the time. ctags takes more than a month to run using 64 very fast cpus. ctags is written in c, so hundreds of times faster, but it does much more processing than just language recognition.

What are your thoughts? A project for another hackathon?

Some of the problems may be with inaccuracies that miss or misidentify the type and parsing of dependencies does not work. Another, it seems to have a very large number of languages, many will not be used much, if at all

ssc-oscar / lookup

Using Linguist to Determine Project Filetypes? #20