Open broeder-j opened 2 years ago
I agree, getting programming languages from the git source would be a good feature. Surely there must be tools available that we could use out-of-the-box to do this, but I'm not aware of them yet. The Github API does provide this information, but I'd rather get it on the lower git level.
Here is what github and gitlab use for the language detection: https://github.com/github/linguist
for a single file one gets:
$ github-linguist grammars.yml
grammars.yml: 884 lines (884 sloc)
type: Text
mime type: text/x-yaml
language: YAML
for a full repo the stats:
$ github-linguist jekyll
70.64% 709959 Ruby
23.04% 231555 Gherkin
3.80% 38178 JavaScript
1.19% 11943 HTML
0.79% 7900 Shell
0.23% 2279 Dockerfile
0.13% 1344 Earthly
0.10% 1019 CSS
0.06% 606 SCSS
0.02% 234 CoffeeScript
0.01% 90 Hack
In my view it would be ok as a codemeta-harverster dependency, through not for codemetapy. Maybe one could also recycle something from what they use there.
For me it is still unclear what should go where for codemetapy, codemetar and codemeta-harvester. (Ideally I would image in structure like this:
But on the other hand one wants to use these on there own, and the resulting codemeta.json should be complete, or not?
I am asking, because I could implement this here to fill the codemeta keys:
programmingLanguage fileFormat fileSize
Here is what github and gitlab use for the language detection: https://github.com/github/linguist
Great, that looks like a good option to integrate. This would be called from codemeta-harvester indeed.
For me it is still unclear what should go where for codemetapy, codemetar and codemeta-harvester.
I can imagine the confusion yes: codemeta-harvester is a more high-level script that invokes various other tools to extract things and do conversions to codemeta, it's basically the glue that ties multiple components together. It relies heavily on codemetapy to do most of the actual work when it comes to parsing and reconciliating multiple codemeta graphs. Codemeta-harvester does most of the preparatory work before delegating control to codemetapy. Codemetapy itself does not rely on other tools (aside from some libraries like rdflib) and implements all of the lower-level logic to parse and serialize codemeta from/to different formats. Codemetapy is not limited to Python but also handles various others from the codemeta crosswalks (NPM package.json, java pom.xml etc), as well as some that are not in the codemeta crosswalk even (e.g. github api, gitlab api, AUTHORS/CONTRIBUTORS, etc). Whilst such could also be conceivable as separate tools, they share a lot of code so it made most sense to integrate them into a single codebase. Codemetapy could be easily be extended with support for other formats (either from the crosswalk or not), the zenodo.json from proycon/codemetapy#29 would be a good candidate to add at this level.
Codemetar is by the original codemeta authors and handles only conversion from R and is more limited in scope that codemetapy.
But on the other hand one wants to use these on there own, and the resulting codemeta.json should be complete, or not?
Codemetapy should be usable standalone yes, whether the output is 'complete' is a matter of how it is invoked exactly and what you consider 'complete'. Codemeta-harvester by definition relies on codemetapy.
The separation is mainly intended to prevent codemetapy from becoming bloated with all kinds of dependencies (most of which would be optional anyway). If it's an external tool we want to use, then it goes into codemeta-harvester.
I am asking, because I could implement this here to fill the codemeta keys:
programmingLanguage fileFormat fileSize
That would be very welcome!
Since the harvester requires basicly a git clone.
It could gather the information about the Programming language and other statistics of the files and repository from the local repo.
@proycon Do you know of any harvesting tool which gets these things already nicely?
https://livablesoftware.com/tools-mine-analyze-github-git-software-data/