proycon / codemeta-harvester

Harvest and aggregate codemeta/schema.org software metadata from source repositories and service endpoints, automatically converting from known metadata schemes in the process
GNU General Public License v3.0
8 stars 4 forks source link

Codemeta Harvester

Project Status: Active -- The project has reached a stable, usable state and is being actively developed.

This is a harvester for software metadata. It actively attempts to detect and convert software metadata in source code repositories and converts this to a unified codemeta representation.

The tool is implemented as a simple POSIX shell script that in turn invokes a number of tools to do the actual work:

A few simple additional metadata extractions methods, as simple shell scripts, have been implemented alongside the main script.

This harvester can be used for two purposes:

  1. to harvest a possibly large number of software projects, for instance to make them available in some kind of search portal.
  2. as a means to produce a codemeta.json file for your own project

Installation

A docker container can be build as follows:

make docker

A pre-built container image can also be pulled from Docker Hub:

docker pull proycon/codemeta-harvester

Alternatively if you prefer not to use containers, you can also install the software as follows:

You can use make devenv if you want to rely on the latest development release of codemetapy, rather than the latest stable version (this will create a devenv/ dir instead of env/)

Usage: producing codemeta for your project

In your project directory, which ideally should be a git clone, you can just run codemeta-harvester to create a codemeta.json file based on the files in your repository:

codemeta-harvester

You probably use the docker container, then the syntax is as follows:

docker run -v $(pwd):/data proycon/codemeta-harvester

The -v argument mounts your current working directory in the container, you may adapt it according to your needs.

If you want to regenerate an existing codemeta.json, rather than use it as input which would be the default behaviour, then add the --regen parameter. This overwrites any existing codemeta.json.

The harvester can make use of the Github/GitLab API to query metdata from GitHub/GitLab, but this allows only limited anonymous requests. Please set the environment variable $GITHUB_TOKEN/$GITLAB_TOKEN to a github personal access token / gitlab personal access token, if you use Docker you should pass it to the container using --env-arg GITHUB_TOKEN=$GITHUB_TOKEN/--env-arg GITLAB_TOKEN=$GITLAB_TOKEN.

Usage: harvesting metadata for various projects

To harvest and collect metadata from various projects, you need to create configuration files that tells the harvester where to look. These are simple yaml configuration files, one for each tool to harvest. They are put into a directory of your choice, and take the following format:

source: https://github.com/user/repo
services:
    - https://example.org

The source property specifies a single source code repository where the source code of the tool lives. This must be git repository that is publicly accessible. Note that you can specify only one repository here, choose the one that is representative for the software as a whole.

The services property lists zero or more URLs where the tool can be accessed as a service. This may be a web application, simple webpage, or some other form of webservice. For webservices, rather than enumerate all service endpoints individually, this should be pointed to a URL that provides itself provides a specification of endpoints, for example a URL serving a OpenAPI specification. The information provided here will be expressed in the resulting codemeta.json through the targetProduct schema.org property as described in issue codemeta/codemeta#271. This links the source code to specific instantiations of the software.

Additional properties you may specify:

Pass the directory where you put your configurations (or a single configuration file) to codemeta-harvester as follows:

codemeta-harvester /path/to/your/configdir/

Or for Docker:

docker run -v /path/to/your/configdir/:/config -v $(pwd):/data proycon/codemeta-harvester /config

Composition and precedence

Codemeta-harvester relies codemetapy to combine different input sources into one codemeta.json, we call this composition.

When a certain input source defines a property (on schema:SoftwareSourceCode), it will overwrite any values that were set earlier by previous sources. This entails that there is a certain order of precedence in which sources codemeta-harvester considers more important than others. The priority is roughly the following:

  1. codemeta.json, if this file is provided, the harvest won't look at anything else (aside from the three exceptions mentioned at the end).
  2. codemeta-harvest.json
  3. CITATION.cff
  4. Language specific metadata from setup.py, pyproject.toml, pom.xml, package.json and similar.
  5. files such as LICENSE, CONTRIBUTORS, AUTHORS, README
  6. Information from git (e.g. contributors, git tag for version, date of first/last commit)
  7. Information from the github or gitlab API (e.g. project name/description)

Three notable exceptions are:

  1. For development status, repostatus badge in the README.md in the git master/main branch takes precendence over all else (overriding whatever is in codemeta.json!)
  2. For maintainers, the parsing of MAINTAINERS in the git master/main branch is always taken into account (merged with anything in codemeta.json)
  3. If the harvester finds a version-specific DOI at Zenodo for your software, it will always use that (overriding whatever is in codemeta.json)

Acknowledgement

This software was funded in the scope of the CLARIAH-PLUS project.