Elements of a non-datalad dataset schema

jsheunis commented 4 months ago

What is this about?

We have recently discussed two main goals or next steps given the current state of this effort:

develop a schema to “export/serialize” any datalad dataset (for archiving, for reporting, for later recreation). This has been taking shape with the "components schema", see https://github.com/psychoinformatics-de/datalad-concepts/pull/44
develop a schema to describe any non-datalad dataset (with the same ontology concepts), as a basis for application specific schemas (e.g. SFB1451, ABCD-J, INM-7), and auto-conversing to DataLad datasets.

This issue serves as a starting point to define the elements of goal 2. Separate efforts have been made previously that could be considered related:

The schemas developed for datalad-catalog, see: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/catalog/schema (this one is in jsonschema, while the rest are linkml)
The SFB1451 schema draft by @mslw, see https://github.com/sfb1451/crc-schema-draft/blob/main/src/sfb1451_schema.yaml
The ABCD-J schema (basically a copy of the SFB one), see https://github.com/abcd-j/schema/blob/main/src/abcdj_schema.yaml
The attempt to merge the two above into a more general "research dataset schema", see https://github.com/psychoinformatics-de/datalad-concepts/pull/12

None of these schemas take the recent developments of ontology concepts and schemas, nor the need to deal with LinkML limitations, into account.

So, next steps?

We need to somehow narrow the schema down to a set of elements that we agree should be represented in such a "non-datalad dataset", and we should then somehow define such elements with reference to existing (or to be developed) ontology concepts.

I don't know what would be a good way to collaborate on this. My intuition is to just take elements from the existing efforts and group them into some conceptual hierarchy (not necessarily a linkml schema yet), and to start discussing that.

I'll start with the following very simplistic approach:

A general dataset and its properties

Consider a single version of a dataset, published in and retrievable from a single public location. Such a dataset could have these properties:

Property	Single/Many	Comment
`identifier`	single	likely something unique within the context of where the dataset is hosted (although this could mean the identifier is not a property of the dataset, rather only helpful within the context of something like a data catalog)
`name`	single	shorthand for referring to the dataset or alternatively identifying it, distinct from `title`
`title`	single	for human consumption
`url`	single	main point of access for the dataset; could be alias of something like a `homepage`
`doi`	single	persistent digital object identifier; distinct from `url` which could change
`description`	single	free text to describe the dataset for human understanding
keyword	many	keywords that describe the dataset and help with findability
date_published	single	the date on which the dataset (or its metadata) was published
date_modified	single	the last modified date of the represented version of the dataset
author	many	persons or organizations that helped create/author the dataset; likely related to our existing provenance concepts e.g. `was_generated_by`
`data-controller`	many	persons or organizations in control of the data; can be contacted for access/information requests
publication	many	any publication existing in any of various formats that reference the dataset in some way or that the dataset has some definable relationship with
funding	many	any monetary grant or form of support that played a role in the dataset coming into existence
`part`	many	any entity that forms a logical part of the dataset; such as a `file` or another `dataset` (some discussion about the concept of a file here: https://github.com/psychoinformatics-de/datalad-concepts/issues/14)
provenance_activity	many	any and all activities that were conducted by some agent (person or organization or code) that led to the generation of a certain state or part of the dataset

That is all I can muster now. Although, some additional notes came to mind while summarising this:

The provenance_activity (or whatever an improved term for that may be) could be a nice catch-all for several of the other properties. I'm thinking specifically of author, funding, and any properties associated with a date (date published or modified, for example). All of these constitute some activity that was done at some time by some agent on some entity (of the dataset) and that lead to some new state. See the provenance concepts in datalad-concepts for reference: https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/ontology/provenance.yaml.
I somewhat intentionally ignored a dataset version, because I am still unsure about whether we want to represent multi-versioned datasets or not in this general schema that we are aiming for.
The relationship between name and a dataset identifier is still undetermined for me.

mih commented 4 months ago

Thanks for writing this up.

My high-level comments are:

I somewhat intentionally ignored a dataset version, because I am still unsure about whether we want to represent multi-versioned datasets

I think this is not about multi-version. However, any dataset we would be describing would effective be a concrete version of a dataset, because it would effectively describe the full content -- which is only meaningfully defined for a single version.

On top of that, and for everything else, I'd say that it is without (a known) alternative to stick closely to Data Catalog Vocabulary (DCAT; version 3). This is the defacto standard. It is widely adopted, and also the conceptual basis for https://schema.org/Dataset

This vocabulary conveniently dictates how essential properties must be framed. I started the modeling in linkml in https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/ontology/datasets.yaml

I would caution against the "normative" declarations in the second column above. Except for the date_* fields (DCAT equivalent https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_release_date) it seems a needless restriction to refuse multi-value specifications. It may be uncommon to have more than one title, but it is certainly not impossible to have a single dataset be described slightly differently in two systems. An aggregate record (which one be useful to capture here) would need to be able to carry them all.

mih commented 3 months ago

I think we can close this now. The current schema has all that was listed here, and a lot more.

psychoinformatics-de / datalad-concepts