psychoinformatics-de / datalad-concepts

Other
3 stars 2 forks source link

Elements of a non-datalad dataset schema #46

Closed jsheunis closed 3 months ago

jsheunis commented 4 months ago

What is this about?

We have recently discussed two main goals or next steps given the current state of this effort:

  1. develop a schema to “export/serialize” any datalad dataset (for archiving, for reporting, for later recreation). This has been taking shape with the "components schema", see https://github.com/psychoinformatics-de/datalad-concepts/pull/44
  2. develop a schema to describe any non-datalad dataset (with the same ontology concepts), as a basis for application specific schemas (e.g. SFB1451, ABCD-J, INM-7), and auto-conversing to DataLad datasets.

This issue serves as a starting point to define the elements of goal 2. Separate efforts have been made previously that could be considered related:

None of these schemas take the recent developments of ontology concepts and schemas, nor the need to deal with LinkML limitations, into account.

So, next steps?

We need to somehow narrow the schema down to a set of elements that we agree should be represented in such a "non-datalad dataset", and we should then somehow define such elements with reference to existing (or to be developed) ontology concepts.

I don't know what would be a good way to collaborate on this. My intuition is to just take elements from the existing efforts and group them into some conceptual hierarchy (not necessarily a linkml schema yet), and to start discussing that.

I'll start with the following very simplistic approach:

A general dataset and its properties

Consider a single version of a dataset, published in and retrievable from a single public location. Such a dataset could have these properties:

Property Single/Many Comment
identifier single likely something unique within the context of where the dataset is hosted (although this could mean the identifier is not a property of the dataset, rather only helpful within the context of something like a data catalog)
name single shorthand for referring to the dataset or alternatively identifying it, distinct from title
title single for human consumption
url single main point of access for the dataset; could be alias of something like a homepage
doi single persistent digital object identifier; distinct from url which could change
description single free text to describe the dataset for human understanding
keyword many keywords that describe the dataset and help with findability
date_published single the date on which the dataset (or its metadata) was published
date_modified single the last modified date of the represented version of the dataset
author many persons or organizations that helped create/author the dataset; likely related to our existing provenance concepts e.g. was_generated_by
data-controller many persons or organizations in control of the data; can be contacted for access/information requests
publication many any publication existing in any of various formats that reference the dataset in some way or that the dataset has some definable relationship with
funding many any monetary grant or form of support that played a role in the dataset coming into existence
part many any entity that forms a logical part of the dataset; such as a file or another dataset (some discussion about the concept of a file here: https://github.com/psychoinformatics-de/datalad-concepts/issues/14)
provenance_activity many any and all activities that were conducted by some agent (person or organization or code) that led to the generation of a certain state or part of the dataset

That is all I can muster now. Although, some additional notes came to mind while summarising this:

mih commented 4 months ago

Thanks for writing this up.

My high-level comments are:

I somewhat intentionally ignored a dataset version, because I am still unsure about whether we want to represent multi-versioned datasets

I think this is not about multi-version. However, any dataset we would be describing would effective be a concrete version of a dataset, because it would effectively describe the full content -- which is only meaningfully defined for a single version.

On top of that, and for everything else, I'd say that it is without (a known) alternative to stick closely to Data Catalog Vocabulary (DCAT; version 3). This is the defacto standard. It is widely adopted, and also the conceptual basis for https://schema.org/Dataset

This vocabulary conveniently dictates how essential properties must be framed. I started the modeling in linkml in https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/ontology/datasets.yaml

I would caution against the "normative" declarations in the second column above. Except for the date_* fields (DCAT equivalent https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_release_date) it seems a needless restriction to refuse multi-value specifications. It may be uncommon to have more than one title, but it is certainly not impossible to have a single dataset be described slightly differently in two systems. An aggregate record (which one be useful to capture here) would need to be able to carry them all.

mih commented 3 months ago

I think we can close this now. The current schema has all that was listed here, and a lot more.