Bottom-up / crowdsourcing approach?

katrinleinweber commented 6 years ago

To practice shell programming, I created ~~this gist to clone an R package repo, generate a codemeta.json, and prepare a pull/merge request~~. EDIT: deleted for the reasons given below.

I was wondering whether that kind of approach would be OK? Productive procrastination, but also kind of cold-calling.

Or, is the consensus rather, that codemeta.json generation should happen within the workflows that people already use and as automatically as possible?

cboettig commented 6 years ago

We're asking ourselves the same question! cc @noamross @maelle , who are talking about how this would work as part of the onboarding requirements / checks at rOpenSci (see https://github.com/ropensci/onboarding).

Automated PRs across GH is an interesting but more provocative question, I know this is something @arfon has thought about wrt codemeta.json and I'd be curious to hear his latest thinking.

In practice, we've found that it's often better if some more metadata can be added to the DESCRIPTION files than what most authors are currently doing (in particular, it's really nice for codemeta.json to have ORCID ids, and though DESCRIPTION files now support that thanks in part to codemeta, few authors have adopted this so far).

arfon commented 6 years ago

Automated PRs across GH is an interesting but more provocative question, I know this is something @arfon has thought about wrt codemeta.json and I'd be curious to hear his latest thinking.

Anything automated is considered spam by GitHub and you'd quickly be banned. Something that makes is super-simple to allow a human to open a pull request to someone else's repository is likely more acceptable.

katrinleinweber commented 6 years ago

It's the latter ;-)

maelle commented 6 years ago

related to #20

katrinleinweber commented 6 years ago

I now think that a bottom-up approach would not be successful, because it essentially means pushing "speculative complexity" towards project (and into their repos). "Speculative", because one hopes that a repository will make use of the codemeta.json. Complexity, because it adds an auto-generated file to the Git repo and one step to the build/release process.

I understood CodeMeta's main goal as "improv[ing] how [repositories] talk to each other" in order to close gaps in software "preservation, discovery, reuse, and attribution" on the "infrastructure" level. Thus, the IMHO strongest argument for a bottom-up approach would be prototyping the metadata aggregation (currently in codemeta.json), no? But with codemetar, is that happening on the "infrastructure" level? Currently, the infrastructure providers seem to be reducing complexity on their end (by reading-in a file, rather than aggregating themselves), but is that aligned with CodeMeta's goals?

If not, maybe a long-term winning strategy (and more elegant handling of the complexity) on their level would be to roll out a "2nd system" for repositories to integrate into their import procedures. Metaphorically speaking: The metadata camel needs to go through the eye of the repo-needle at some point, so it should be lubricated then and there ;-)

cboettig commented 6 years ago

Good questions, with no simple answers but maybe can add some perspective.

Personally, I agree with you that getting major repositories on board, in more of a top-down approach, is most efficient in realizing systematic change. That was reflected from basically the start of codemeta, where most of the initial workshop participants represented major repositories: https://codemeta.github.io/workshop/

Second, it is worth noting that Zenodo already does something quite like this with it's GitHub integration, and has for some time. That is, it parses information, not from DESCRIPTION files, but from GitHub metadata, and constructs a metadata record which it can provide in JSON-LD form. However, despite their central importance, repositories like Zenodo have very limited capacity for additional developer support, so I don't think they would ever take it upon themselves to implement a direct parser for R DESCRIPTION files and then similarly for all possible languages. Zenodo and other repositories that participated in the codemeta workshop are still interested in tackling the 'easier' case of parsing a standardized format like codemeta.json that can be shared across languages, but even adopting something like that is a big lift on top of their existing infrastructure and limited capacity, so we're still waiting for that to happen.

There's also a flip side in which not all of this is easily automated. codemetar now gives "opinions" back to the user to help encourage best practices of where and how authors can add additional metadata to help with this, but without such interaction with the user, such metadata will be more limited and not always perfectly correct. The R package model is also already somewhat richer in metadata than many other languages, in some (possibly many) cases, what might be much more useful is a simple generic tool for authoring codemeta.json manually, agnostic of the computer language, rather than 'automatically' populating it. @arfon has some nice ruby-based CLI tools for this, I'm meaning to add a web-based UI when I get a chance as well).

So in the long run, I agree that we are likely to get mostly only the extreme tail of users to adopt a 'codemeta.json' file in a purely bottom-up manner, but I also think there's a key role for organizations like ours to play in coordinating / facilitating the effort of the repositories on one hand and field-testing the approaches on the other.

katrinleinweber commented 6 years ago

Thanks for continuing the discussion :-)

[…] despite their central importance, repositories like Zenodo have very limited capacity for additional developer support, so I don't think they would ever take it upon themselves to implement a direct parser for R DESCRIPTION files and then similarly for all possible languages

Agreed, they should never have to. So, the next/real question may be how easy or difficult it is for them to integrate (Caution: pseudo-code!):

if (import == R package) {
  Rscript -e "codemetar::write_codemeta()"
} else if (import == Python module) {
  pip show -v somepackage | codemetapy > codemeta.json
} ...

metadata = read.file("codemeta.json")

into their import pipelines, isn't it?

About the "opinions": I know (#174) ;-) There are interesting use-cases for that, both offline and during the repo-import. I'm not sure where to write about them, but here is not the best place, IMHO.

cboettig commented 6 years ago

@katrinleinweber

So, the next/real question may be how easy or difficult it is for them to integrate ...

Ah, right, I see what you mean now. Still, running something like R on an incoming packages at that scale is a whole 'nother ball game from parsing some json data...

katrinleinweber commented 6 years ago

In terms of complexity of the necessary pipeline (CI, VMs, etc.), or in terms of computational load on the repository providers servers?

I imagine it as a step before reading a codemeta.json as before. Just one that is self-generated intermediately, instead of already part of the ingested set of files.

cboettig commented 6 years ago

sorry, I was really just speculating above, which isn't that helpful. really this is a discussion we should have with the individual providers.

katrinleinweber commented 5 years ago

https://github.com/zenodo/zenodo/issues/1504 is thinking about this as well :-)

cboettig commented 5 years ago

:eyes: nice!

maelle commented 5 years ago

Should we close this issue that's more a general discussion and take the convo to https://discuss.ropensci.org/?

katrinleinweber commented 5 years ago

I'd prefer to keep it in the one place where it started.

ropensci / codemetar

Bottom-up / crowdsourcing approach? #57