ssc-oscar / lookup

A mirror of bitbucket.org/swcs/lookup
1 stars 4 forks source link

Using Linguist to Determine Project Filetypes? #20

Open k----n opened 3 years ago

k----n commented 3 years ago

It appears to me that looking at file extensions is used to determine the file types here: https://github.com/ssc-oscar/lookup/blob/5e78bbfe322a83f425c2cf8d7982d2be4e82d79b/prjSummary.perl#L114-L147

Would it be better to use the github linguist library instead? https://github.com/github/linguist

audrism commented 3 years ago

I recently checked the linguist library is not very comprehensive (and also had some glaring issues). I also use ctokens and file to derive this list empirically. It may have some specifc types that it might capture more accurately, however. So adding it as well would be good (though I recall it may be written in java eww..)

k----n commented 3 years ago

It looks like linguist is a ruby gem: https://github.com/github/linguist#usage With CLI: https://github.com/github/linguist#command-line-usage and it can run against single files:

$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
  type:      Text
  mime type: text/x-yaml
  language:  YAML

In terms of programming language extensions, there appears to 395 here: https://github.com/github/linguist/blob/master/lib/linguist/languages.yml

With other types of files able to be classified too like data, markup, prose, e.g.

CSV:
  type: data
  ...
  extensions:
  - ".csv"

HTML:
  type: markup
  ...
  aliases:
  - xhtml
  extensions:
  - ".html"
  - ".htm"
  - ".html.hl"
  - ".inc"
  - ".st"
  - ".xht"
  - ".xhtml"

Markdown:
  type: prose
  ...
  aliases:
  - pandoc
  ...
  extensions:
  - ".md"
  - ".markdown"
  - ".mdown"
  - ".mdwn"
  - ".mdx"
  - ".mkd"
  - ".mkdn"
  - ".mkdown"
  - ".ronn"
  - ".workbook"

It also looks like coffee.md is classified as its own programming language (even though it compiles into javascript https://coffeescript.org/):

Literate CoffeeScript:
  type: programming
  ...
  aliases:
  - litcoffee
  extensions:
  - ".litcoffee"
  - ".coffee.md"

I'd argue that Coffeescript is like Typescript and it should be its own language -- they both compile to Javascript.

I guess right now only programming language types are covered in WoC, but it would be good to cover other file types like data or markup to better characterize what a project actually does.

audrism commented 3 years ago

I'd be happy to support other types in WoC.

  1. How to define what is a language is a bit of a problem , though. TypeScript appears not to be separated
  2. How to determine what language the file contains is another problem. For example, linguist has a set of heuristics other than file or ctags
  3. Why we want languages is also important. One of the goals is to parse dependencies, other to assess popularity
  4. Efficiency is important, while it can be deployed as a container, but it may take a while to go over 8B blobs

What are your thoughts? A project for another hackathon?

Some of the problems may be with inaccuracies that miss or misidentify the type and parsing of dependencies does not work. Another, it seems to have a very large number of languages, many will not be used much, if at all