proycon / codemeta-harvester

Harvest and aggregate codemeta/schema.org software metadata from source repositories and service endpoints, automatically converting from known metadata schemes in the process
GNU General Public License v3.0
8 stars 4 forks source link

Harvest DOI #3

Closed proycon closed 2 years ago

proycon commented 2 years ago

Add as extra schema:identifier

broeder-j commented 2 years ago

I have not looked in detail at your bash code. But from what I understood it does look at badges in the README and codemetapy also looks at Badges in the html. Does codemetapy check for DOIs there, or manly about repo status?

For gitlab, the Badges are not in the repo, but displayed in the gitlab html and retrievable via the API (https://docs.gitlab.com/ee/api/project_badges.html). In these one could look for DOIs.

proycon commented 2 years ago

It does not check for DOIs yet (hence this issue), but this is indeed something that is on my radar. Harvesting from README badges should be fairly easy. Good point about gitlab though, that might need some extra work in the codemeta.parsers.gitapi module.

proycon commented 2 years ago

Harvesting zenodo DOI badges from READMEs is not so straightforward as I thought. Ideally each software version gets its own DOI, that's also what zenodo does by default and that works nicely with the default github-zenodo integration. But there's a catch-22 / chicken-egg problem here even the DOI is to be included in the release itself (like in the README.md as a badge which we can harvest):

The version specific DOI does not exist until the version is released and archived by zenodo, at that point, it can't be included in the README pertaining to the same release anymore (as it's already released). Zenodo has a solution for this, one can include a /latestdoi/ badge that will automatically get the latest DOI for the software. Such an URL can be queried to get JSON-LD from the zenodo api, which returns the actual DOI:

$ curl --header "Accept: application/ld+json" -L https://zenodo.org/badge/latestdoi/20526435
{
  "@context": "http://schema.org",
  "@type": "SoftwareSourceCode",
  "@id": "https://doi.org/10.5281/zenodo.6882966",
  "identifier": {
    "@type": "PropertyValue",
    "propertyID": "URL",
    "value": "https://zenodo.org/record/6882966"
  },
  "url": "https://zenodo.org/record/6882966",
  "name": "LanguageMachines/frog: v0.25",

Retrieving a DOI this way will only work for the latest release. DOIs of earlier releases can't be retrieved via the badge by this method. If we harvest automatically this means we will always assign the DOI of the latest release, even if the software describes an earlier release, this is sub-optimal and violates the important principle of distinct identifiers for distinct versions.

It looks like what gitlab does is a better solution that circumvents this problem. But the general problem remains: it seems impossible to include a codemeta.json in a source repository that references a version-specific DOI, and then git tag the version. We can at most add a DOI at a later point (which is a valid use-case that codemeta-harvester also supports, like when used with https://github.com/CLARIAH/tool-discovery and https://github.com/proycon/codemeta-server), but even then automatically harvesting the version-specific one is a challenge.

proycon commented 2 years ago

Zenodo itself has some info on this here: https://help.zenodo.org/#versioning

broeder-j commented 2 years ago

But what most scientist want is a base DOI/conceptdoi like (10.5281/zenodo.594126 in the example you posted), which stays the same and gets all the summed up credit of all release citations. This base DOI is often in the README and in the Badge. So if people say that is how it should be cited, one can use it in my opinion. you can get it over the zenodo api using the current DOI:

https://zenodo.org/api/records/6882966 

if one can get in addition the DOI for the current/latest release somehow that would be also fine.

also maybe one can querry zenodo for the 'right' DOI for the verison saved into codemeta.json, since the software version is saved as related_identifier : "https://github.com/LanguageMachines/frog/tree/v0.25" and in the version : "v0.25" metadata entry.

proycon commented 2 years ago

So if people say that is how it should be cited, one can use it in my opinion.

Zenodo itself recommends version DOIs over concept DOIs in that latest link I gave, and I agree that's better, versions are important when considering things like scientific reproducibility. It's therefore also one of the principles in this paper: FAIR Principles for Research Software:

F1.2. Different versions of the software are assigned distinct identifiers. To make different versions of the same software (or component) findable, each version needs to be assigned a different identifier. The relationship between versions is embodied in the associated metadata. What is considered a “version” is defined by the owner of the software: in many cases this will be something that the owner wants to specifically identify and use and/or “release” or “publish” so that others can use and reference/cite. There are existing software engineering practices (e.g., version control, semantic versioning) around the management and versioning of software that may form part of the implementation of these relationships. Capturing the relationships between different versions of software will lead to greater understanding of the evolution of code, its authorship, ownership, description and purpose,

also maybe one can querry zenodo for the 'right' DOI for the verison saved into codemeta.json, since the software version is saved as related_identifier : "https://github.com/LanguageMachines/frog/tree/v0.25" and in the version : "v0.25" metadata entry.

That is a good idea! If zenodo allows querying on those keys and values then that would work.

Edit: querying works:

 curl -i "https://zenodo.org/api/records/?q=related.identifier:\"https://github.com/LanguageMachines/frog/tree/v0.25\""
broeder-j commented 2 years ago

Context for you to this: There is this project, https://github.com/hermes-hmc which works on a generalized 'push' approach for software publications (like github to zenodo, but from a CI pipeline). Maybe one could think in this context of a way to solve the hen egg DOI problem in the publishing process. For example the DOI could also go into the release page, description and or even git history for the tag, or maybe the CI could first 'reserve the doi' and then apply the last changes before uploading all files and finalizing the publications...

proycon commented 2 years ago

Interesting, project.

For example the DOI could also go into the release page, description and or even git history for the tag, or maybe the CI could first 'reserve the doi' and then apply the last changes before uploading all files and finalizing the publications...

Right, the release notes is a place where I have seen some people add a DOI, that would work (though as long as it's a manual effort I doubt it'll catch on enough).

I'm implementing the zenodo query approach you suggested, using related.identifier, now. That would at least solve it for the combination github + zenodo.

proycon commented 2 years ago

This is now implemented and released (v0.3.0)