ropensci / roregistry

ropensci registry
13 stars 5 forks source link

spec.json - definition of metadata for each pkg #5

Closed sckott closed 5 years ago

sckott commented 8 years ago

started, working on now https://github.com/ropensci/roregistry/blob/master/spec.json

cboettig commented 7 years ago

Consider codemeta.json instead, https://codemeta.github.io/terms/, https://github.com/codemeta/codemetar

sckott commented 7 years ago

right, can we add arbitrary terms though ?

cboettig commented 7 years ago

I don't have an official word/test-case from CRAN about adding arbitrary terms in DESCRIPTION, but I'll try to find that out very soon. Of course roxygen/devtools add such terms but maybe they are special exceptions

maelle commented 6 years ago

I had forgotten about this issue and have written a schema.json in order to be able to use jsonvalidate. https://github.com/maelle/roregistry

If the schema/spec is instead a codemeta thing, even with arbitrary properties (e.g. contributer which is either community-contributed or staff-contributed), could the registry be validated? If so how?

maelle commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages, and then the registry would be created by using this info?

maelle commented 6 years ago

Sorry for commenting so many times instead of a single well-thought comment... Is this wrong:

sckott commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages

Not sure yet. I think we will likely add other information that does not reside in each individual package - e.g., adding URLs and such as needed (can't think of any examples right now).

cboettig commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages

Good question. Already codemeta.json includes some information it finds elsewhere: e.g. by reading CITATION files, README files, etc. The goal is simply to let the user write the data in the most logical / obvious spot and extract it, rather than telling a user to manually edit some metadata file which they will invariably forget to do or forget to update.

On our end, we might find it easier to maintain separate files, outside of the package repo, that have the additional information -- e.g. a JSON-LD file of affiliations of authors, or a list of categories we have added to a particular package, etc. If we are generating and maintaining that additional metadata, it might not make sense to keep that in the package repo at all, but rather in our own separate database. As long as it describes the same identifier (author id, package id), roregistry can ingest both sets and seamlessly merge them. This is the whole idea of the "Linked" in Linked Data.

On the json-schema thing: Not sure I completely follow the goals here but maybe a little background info on design goal of JSON schema vs JSON-LD and how to make them interoperate might be helpful.

The short version of this is that an application (like roregistry) which consumes JSON-LD (e.g. codemeta.json) should ideally use jsonld_compact() (and maybe jsonld_frame()) on the incoming file to get it in the desired format. This takes care of things like having different names for the same concept or wanting to exclude fields that are in the codemeta.json which you don't want to have in the roregistry. You can then verify that compacted JSON is valid (e.g., not missing any "required" fields, not miss-typed a boolean as a string) by applying JSON-schema.

Okay, backing up a bit... often we write applications skipping all of this; we just assume the JSON we get is formatted in a way we can interpret, and we don't use json-schema or the json-ld operations (e.g. see just about every rOpenSci package that consumes JSON data from an API).

If the JSON data is being written "by hand" rather than an API, it is of course helpful to make sure the fields have been defined correctly, no fields are missing and the data has the correct type. This is where JSON schema is super useful.

JSON-LD tries to solve a somewhat different problem, where the same "generic" JSON data is being consumed by different applications which may have different requirements for what fields they use (and also what they call those fields or how they are nested). If two different use cases are covered by two different applications (say, Zenodo vs roregistry), each may have it's own JSON schema and codemeta.json might not be able to satisfy both -- e.g. some terms might be unique to roregistry, others to Zenodo; or the same term might be called "id" in Zenodo data but "identifier" in the registry etc. JSON-LD solves this by telling both app developers to write separate "Context" files. (Different nesting can be handled by the addition of a frame command, but that's less common). Okay, I not sure if that was at all useful or just a rant.

maelle commented 6 years ago

Super useful, thanks a ton! 😺

maelle commented 6 years ago

So now that things are clearer for me here is a summary and a list of the remaining questions I have before actually updating the registry. The registry transformation will include 1) a recycling and improvement of the old (current) registry json file; coupled with an update of the ropkgs package 2) regular updates when packages are added to the suite.

I think step 1) should include the validation I've started working on by transforming the old spec into a schema (which was easy). This way, we're sure of what the old registry brings.

Then I have several questions

maelle commented 6 years ago

Attaching my drawings.

codemeta roregistry

cboettig commented 6 years ago

How to define the wishlist/spec for the new registry? Should it be a schema or a context? It will probably not be used for validation but rather as a todo-list when writing scripts to populate the registry, and as documentation.

Hmm, good question, but probably a schema. (or maybe just a 'template'. fwiw , json-ld frames can basically look like templates too).

How to update the new registry? I think it should be done by a script.

Agree. We should be able to script this, (draft here, could be much improved: https://ropensci.github.io/codemetar/articles/D-codemeta-parsing.html ) and run it by hand until we feel things are going smoothly, then we can set up a cron job for it.

Where should/does information live, and when it lives in several places which source should we use?

Good question! My instinct is that if it's metadata that can/should be maintained by the package maintainer, it should live in the most obvious / official spot in the package repo (e.g. DESCRIPTION, NEWS, etc). Additional data we create and maintain should probably just live in roregistry repo?