Extract programming language and files information from repository

broeder-j commented 2 years ago

Since the harvester requires basicly a git clone.

It could gather the information about the Programming language and other statistics of the files and repository from the local repo.

@proycon Do you know of any harvesting tool which gets these things already nicely?

https://livablesoftware.com/tools-mine-analyze-github-git-software-data/

proycon commented 2 years ago

I agree, getting programming languages from the git source would be a good feature. Surely there must be tools available that we could use out-of-the-box to do this, but I'm not aware of them yet. The Github API does provide this information, but I'd rather get it on the lower git level.

broeder-j commented 2 years ago

Here is what github and gitlab use for the language detection: https://github.com/github/linguist

for a single file one gets:

$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
  type:      Text
  mime type: text/x-yaml
  language:  YAML

for a full repo the stats:

$ github-linguist jekyll

70.64%  709959     Ruby
23.04%  231555     Gherkin
3.80%   38178      JavaScript
1.19%   11943      HTML
0.79%   7900       Shell
0.23%   2279       Dockerfile
0.13%   1344       Earthly
0.10%   1019       CSS
0.06%   606        SCSS
0.02%   234        CoffeeScript
0.01%   90         Hack

In my view it would be ok as a codemeta-harverster dependency, through not for codemetapy. Maybe one could also recycle something from what they use there.

broeder-j commented 2 years ago

For me it is still unclear what should go where for codemetapy, codemetar and codemeta-harvester. (Ideally I would image in structure like this:

use codemeta-harvester this relies on the others to get more detail information in the case it is an R or python project. So any stuff which is language independent should be put into codemeta-harvester or not?

But on the other hand one wants to use these on there own, and the resulting codemeta.json should be complete, or not?

I am asking, because I could implement this here to fill the codemeta keys:

programmingLanguage fileFormat fileSize

proycon commented 2 years ago

Here is what github and gitlab use for the language detection: https://github.com/github/linguist

Great, that looks like a good option to integrate. This would be called from codemeta-harvester indeed.

For me it is still unclear what should go where for codemetapy, codemetar and codemeta-harvester.

I can imagine the confusion yes: codemeta-harvester is a more high-level script that invokes various other tools to extract things and do conversions to codemeta, it's basically the glue that ties multiple components together. It relies heavily on codemetapy to do most of the actual work when it comes to parsing and reconciliating multiple codemeta graphs. Codemeta-harvester does most of the preparatory work before delegating control to codemetapy. Codemetapy itself does not rely on other tools (aside from some libraries like rdflib) and implements all of the lower-level logic to parse and serialize codemeta from/to different formats. Codemetapy is not limited to Python but also handles various others from the codemeta crosswalks (NPM package.json, java pom.xml etc), as well as some that are not in the codemeta crosswalk even (e.g. github api, gitlab api, AUTHORS/CONTRIBUTORS, etc). Whilst such could also be conceivable as separate tools, they share a lot of code so it made most sense to integrate them into a single codebase. Codemetapy could be easily be extended with support for other formats (either from the crosswalk or not), the zenodo.json from proycon/codemetapy#29 would be a good candidate to add at this level.

Codemetar is by the original codemeta authors and handles only conversion from R and is more limited in scope that codemetapy.

But on the other hand one wants to use these on there own, and the resulting codemeta.json should be complete, or not?

Codemetapy should be usable standalone yes, whether the output is 'complete' is a matter of how it is invoked exactly and what you consider 'complete'. Codemeta-harvester by definition relies on codemetapy.

The separation is mainly intended to prevent codemetapy from becoming bloated with all kinds of dependencies (most of which would be optional anyway). If it's an external tool we want to use, then it goes into codemeta-harvester.

I am asking, because I could implement this here to fill the codemeta keys:

programmingLanguage fileFormat fileSize

That would be very welcome!

proycon / codemeta-harvester

Extract programming language and files information from repository #6