Closed jsheunis closed 3 months ago
Thanks for writing this up.
My high-level comments are:
I somewhat intentionally ignored a dataset version, because I am still unsure about whether we want to represent multi-versioned datasets
I think this is not about multi-version. However, any dataset we would be describing would effective be a concrete version of a dataset, because it would effectively describe the full content -- which is only meaningfully defined for a single version.
On top of that, and for everything else, I'd say that it is without (a known) alternative to stick closely to Data Catalog Vocabulary (DCAT; version 3). This is the defacto standard. It is widely adopted, and also the conceptual basis for https://schema.org/Dataset
This vocabulary conveniently dictates how essential properties must be framed. I started the modeling in linkml in https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/ontology/datasets.yaml
I would caution against the "normative" declarations in the second column above. Except for the date_*
fields (DCAT equivalent https://www.w3.org/TR/vocab-dcat-3/#Property:distribution_release_date) it seems a needless restriction to refuse multi-value specifications. It may be uncommon to have more than one title
, but it is certainly not impossible to have a single dataset be described slightly differently in two systems. An aggregate record (which one be useful to capture here) would need to be able to carry them all.
I think we can close this now. The current schema has all that was listed here, and a lot more.
What is this about?
We have recently discussed two main goals or next steps given the current state of this effort:
This issue serves as a starting point to define the elements of goal 2. Separate efforts have been made previously that could be considered related:
datalad-catalog
, see: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/catalog/schema (this one is in jsonschema, while the rest are linkml)None of these schemas take the recent developments of ontology concepts and schemas, nor the need to deal with LinkML limitations, into account.
So, next steps?
We need to somehow narrow the schema down to a set of elements that we agree should be represented in such a "non-datalad dataset", and we should then somehow define such elements with reference to existing (or to be developed) ontology concepts.
I don't know what would be a good way to collaborate on this. My intuition is to just take elements from the existing efforts and group them into some conceptual hierarchy (not necessarily a linkml schema yet), and to start discussing that.
I'll start with the following very simplistic approach:
A general dataset and its properties
Consider a single version of a dataset, published in and retrievable from a single public location. Such a dataset could have these properties:
identifier
name
title
title
url
homepage
doi
url
which could changedescription
was_generated_by
data-controller
part
file
or anotherdataset
(some discussion about the concept of a file here: https://github.com/psychoinformatics-de/datalad-concepts/issues/14)That is all I can muster now. Although, some additional notes came to mind while summarising this:
provenance_activity
(or whatever an improved term for that may be) could be a nice catch-all for several of the other properties. I'm thinking specifically ofauthor
,funding
, and any properties associated with a date (date published or modified, for example). All of these constitute some activity that was done at some time by some agent on some entity (of the dataset) and that lead to some new state. See the provenance concepts indatalad-concepts
for reference: https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/ontology/provenance.yaml.name
and a datasetidentifier
is still undetermined for me.