spec.json - definition of metadata for each pkg

sckott commented 8 years ago

started, working on now https://github.com/ropensci/roregistry/blob/master/spec.json

cboettig commented 7 years ago

Consider codemeta.json instead, https://codemeta.github.io/terms/, https://github.com/codemeta/codemetar

sckott commented 7 years ago

right, can we add arbitrary terms though ?

cboettig commented 7 years ago

I don't have an official word/test-case from CRAN about adding arbitrary terms in DESCRIPTION, but I'll try to find that out very soon. Of course roxygen/devtools add such terms but maybe they are special exceptions

maelle commented 6 years ago

I had forgotten about this issue and have written a schema.json in order to be able to use jsonvalidate. https://github.com/maelle/roregistry

If the schema/spec is instead a codemeta thing, even with arbitrary properties (e.g. contributer which is either community-contributed or staff-contributed), could the registry be validated? If so how?

maelle commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages, and then the registry would be created by using this info?

maelle commented 6 years ago

Sorry for commenting so many times instead of a single well-thought comment... Is this wrong:

we write a schema.json that includes all information we want in the registry, with their types. We validate the registry against it, independently from how the registry is updated.
In a while all rOpenSci packages will have a codemeta.json that will have to include information for the registry (and for/from DESCRIPTION). This codemeta.json will have to be compliant with the registry schema.json but not only that? And in that case, would how does the validation step for the codemeta.json of each package work?

sckott commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages

Not sure yet. I think we will likely add other information that does not reside in each individual package - e.g., adding URLs and such as needed (can't think of any examples right now).

cboettig commented 6 years ago

Is the idea of using codemeta that all info contained in the registry should be in the DESCRIPTION of the packages

Good question. Already codemeta.json includes some information it finds elsewhere: e.g. by reading CITATION files, README files, etc. The goal is simply to let the user write the data in the most logical / obvious spot and extract it, rather than telling a user to manually edit some metadata file which they will invariably forget to do or forget to update.

On our end, we might find it easier to maintain separate files, outside of the package repo, that have the additional information -- e.g. a JSON-LD file of affiliations of authors, or a list of categories we have added to a particular package, etc. If we are generating and maintaining that additional metadata, it might not make sense to keep that in the package repo at all, but rather in our own separate database. As long as it describes the same identifier (author id, package id), roregistry can ingest both sets and seamlessly merge them. This is the whole idea of the "Linked" in Linked Data.

On the json-schema thing: Not sure I completely follow the goals here but maybe a little background info on design goal of JSON schema vs JSON-LD and how to make them interoperate might be helpful.

The short version of this is that an application (like roregistry) which consumes JSON-LD (e.g. codemeta.json) should ideally use jsonld_compact() (and maybe jsonld_frame()) on the incoming file to get it in the desired format. This takes care of things like having different names for the same concept or wanting to exclude fields that are in the codemeta.json which you don't want to have in the roregistry. You can then verify that compacted JSON is valid (e.g., not missing any "required" fields, not miss-typed a boolean as a string) by applying JSON-schema.

Okay, backing up a bit... often we write applications skipping all of this; we just assume the JSON we get is formatted in a way we can interpret, and we don't use json-schema or the json-ld operations (e.g. see just about every rOpenSci package that consumes JSON data from an API).

If the JSON data is being written "by hand" rather than an API, it is of course helpful to make sure the fields have been defined correctly, no fields are missing and the data has the correct type. This is where JSON schema is super useful.

JSON-LD tries to solve a somewhat different problem, where the same "generic" JSON data is being consumed by different applications which may have different requirements for what fields they use (and also what they call those fields or how they are nested). If two different use cases are covered by two different applications (say, Zenodo vs roregistry), each may have it's own JSON schema and codemeta.json might not be able to satisfy both -- e.g. some terms might be unique to roregistry, others to Zenodo; or the same term might be called "id" in Zenodo data but "identifier" in the registry etc. JSON-LD solves this by telling both app developers to write separate "Context" files. (Different nesting can be handled by the addition of a frame command, but that's less common). Okay, I not sure if that was at all useful or just a rant.

maelle commented 6 years ago

Super useful, thanks a ton! 😺

maelle commented 6 years ago

So now that things are clearer for me here is a summary and a list of the remaining questions I have before actually updating the registry. The registry transformation will include 1) a recycling and improvement of the old (current) registry json file; coupled with an update of the ropkgs package 2) regular updates when packages are added to the suite.

I think step 1) should include the validation I've started working on by transforming the old spec into a schema (which was easy). This way, we're sure of what the old registry brings.

Then I have several questions

How to define the wishlist/spec for the new registry? Should it be a schema or a context? It will probably not be used for validation but rather as a todo-list when writing scripts to populate the registry, and as documentation.
How to update the new registry? I think it should be done by a script. Questions are technical details that can be figured out when actually writing it, e.g. using framing or not when digesting the codemeta.json; and also whether it should be updated by triggering the update by hand or automatically. I guess running a script by hand every once in a while makes sense.
Where should/does information live, and when it lives in several places which source should we use? In some cases we can actually influence where the info will be stored, e.g. development status by either maintaining our own list or by encouraging the use of status badges in README. Other examples are review information (in DESCRIPTION? or using our own data?), categories (our own table/JSON-ld or codemeta keywords in DESCRIPTION). The discussion for each of the properties can happen in their dedicated issues here I guess, e.g. this one for categories.

maelle commented 6 years ago

Attaching my drawings.

codemeta roregistry

cboettig commented 6 years ago

How to define the wishlist/spec for the new registry? Should it be a schema or a context? It will probably not be used for validation but rather as a todo-list when writing scripts to populate the registry, and as documentation.

Hmm, good question, but probably a schema. (or maybe just a 'template'. fwiw , json-ld frames can basically look like templates too).

How to update the new registry? I think it should be done by a script.

Agree. We should be able to script this, (draft here, could be much improved: https://ropensci.github.io/codemetar/articles/D-codemeta-parsing.html ) and run it by hand until we feel things are going smoothly, then we can set up a cron job for it.

Where should/does information live, and when it lives in several places which source should we use?

Good question! My instinct is that if it's metadata that can/should be maintained by the package maintainer, it should live in the most obvious / official spot in the package repo (e.g. DESCRIPTION, NEWS, etc). Additional data we create and maintain should probably just live in roregistry repo?

ropensci / roregistry

spec.json - definition of metadata for each pkg #5