Open katrinleinweber opened 6 years ago
We're asking ourselves the same question! cc @noamross @maelle , who are talking about how this would work as part of the onboarding requirements / checks at rOpenSci (see https://github.com/ropensci/onboarding).
Automated PRs across GH is an interesting but more provocative question, I know this is something @arfon has thought about wrt codemeta.json
and I'd be curious to hear his latest thinking.
In practice, we've found that it's often better if some more metadata can be added to the DESCRIPTION files than what most authors are currently doing (in particular, it's really nice for codemeta.json to have ORCID ids, and though DESCRIPTION files now support that thanks in part to codemeta, few authors have adopted this so far).
Automated PRs across GH is an interesting but more provocative question, I know this is something @arfon has thought about wrt codemeta.json and I'd be curious to hear his latest thinking.
Anything automated is considered spam by GitHub and you'd quickly be banned. Something that makes is super-simple to allow a human to open a pull request to someone else's repository is likely more acceptable.
It's the latter ;-)
related to #20
I now think that a bottom-up approach would not be successful, because it essentially means pushing "speculative complexity" towards project (and into their repos). "Speculative", because one hopes that a repository will make use of the codemeta.json
. Complexity, because it adds an auto-generated file to the Git repo and one step to the build/release process.
I understood CodeMeta's main goal as "improv[ing] how [repositories] talk to each other" in order to close gaps in software "preservation, discovery, reuse, and attribution" on the "infrastructure" level. Thus, the IMHO strongest argument for a bottom-up approach would be prototyping the metadata aggregation (currently in codemeta.json
), no? But with codemetar, is that happening on the "infrastructure" level? Currently, the infrastructure providers seem to be reducing complexity on their end (by reading-in a file, rather than aggregating themselves), but is that aligned with CodeMeta's goals?
If not, maybe a long-term winning strategy (and more elegant handling of the complexity) on their level would be to roll out a "2nd system" for repositories to integrate into their import procedures. Metaphorically speaking: The metadata camel needs to go through the eye of the repo-needle at some point, so it should be lubricated then and there ;-)
Good questions, with no simple answers but maybe can add some perspective.
Personally, I agree with you that getting major repositories on board, in more of a top-down approach, is most efficient in realizing systematic change. That was reflected from basically the start of codemeta, where most of the initial workshop participants represented major repositories: https://codemeta.github.io/workshop/
Second, it is worth noting that Zenodo already does something quite like this with it's GitHub integration, and has for some time. That is, it parses information, not from DESCRIPTION files, but from GitHub metadata, and constructs a metadata record which it can provide in JSON-LD form. However, despite their central importance, repositories like Zenodo have very limited capacity for additional developer support, so I don't think they would ever take it upon themselves to implement a direct parser for R DESCRIPTION files and then similarly for all possible languages. Zenodo and other repositories that participated in the codemeta workshop are still interested in tackling the 'easier' case of parsing a standardized format like codemeta.json
that can be shared across languages, but even adopting something like that is a big lift on top of their existing infrastructure and limited capacity, so we're still waiting for that to happen.
There's also a flip side in which not all of this is easily automated. codemetar
now gives "opinions" back to the user to help encourage best practices of where and how authors can add additional metadata to help with this, but without such interaction with the user, such metadata will be more limited and not always perfectly correct. The R package model is also already somewhat richer in metadata than many other languages, in some (possibly many) cases, what might be much more useful is a simple generic tool for authoring codemeta.json manually, agnostic of the computer language, rather than 'automatically' populating it. @arfon has some nice ruby-based CLI tools for this, I'm meaning to add a web-based UI when I get a chance as well).
So in the long run, I agree that we are likely to get mostly only the extreme tail of users to adopt a 'codemeta.json' file in a purely bottom-up manner, but I also think there's a key role for organizations like ours to play in coordinating / facilitating the effort of the repositories on one hand and field-testing the approaches on the other.
Thanks for continuing the discussion :-)
[…] despite their central importance, repositories like Zenodo have very limited capacity for additional developer support, so I don't think they would ever take it upon themselves to implement a direct parser for R DESCRIPTION files and then similarly for all possible languages
Agreed, they should never have to. So, the next/real question may be how easy or difficult it is for them to integrate (Caution: pseudo-code!):
if (import == R package) {
Rscript -e "codemetar::write_codemeta()"
} else if (import == Python module) {
pip show -v somepackage | codemetapy > codemeta.json
} ...
metadata = read.file("codemeta.json")
into their import pipelines, isn't it?
About the "opinions": I know (#174) ;-) There are interesting use-cases for that, both offline and during the repo-import. I'm not sure where to write about them, but here is not the best place, IMHO.
@katrinleinweber
So, the next/real question may be how easy or difficult it is for them to integrate ...
Ah, right, I see what you mean now. Still, running something like R on an incoming packages at that scale is a whole 'nother ball game from parsing some json data...
In terms of complexity of the necessary pipeline (CI, VMs, etc.), or in terms of computational load on the repository providers servers?
I imagine it as a step before reading a codemeta.json
as before. Just one that is self-generated intermediately, instead of already part of the ingested set of files.
sorry, I was really just speculating above, which isn't that helpful. really this is a discussion we should have with the individual providers.
https://github.com/zenodo/zenodo/issues/1504 is thinking about this as well :-)
:eyes: nice!
Should we close this issue that's more a general discussion and take the convo to https://discuss.ropensci.org/?
I'd prefer to keep it in the one place where it started.
Also related to https://github.com/force11/force11-sciwg/issues/36, https://github.com/ropensci/codemetar/issues/3 and https://github.com/ropensci/codemetar/issues/20.
To practice shell programming, I created
this gist to clone an R package repo, generate a. EDIT: deleted for the reasons given below.codemeta.json
, and prepare a pull/merge requestI was wondering whether that kind of approach would be OK? Productive procrastination, but also kind of cold-calling.
Or, is the consensus rather, that
codemeta.json
generation should happen within the workflows that people already use and as automatically as possible?