Map the HCLS community profile description onto a datalad dataset

jsheunis commented 11 months ago

Initial comments by @mih:

The HCLS dataset description specification appears to provide a range of required concepts that are missing in RO-crates. Importantly, this includes:

Distinction of summary-level, version-level, and distribution-level entities that could be mapped onto evolving datasets with multiple versions that are deposited in multiple locations.

The HCLS specification appears to be more geared towards a-file-as-a-dataset use cases. However, the aforementioned distinctions could be combined with the single-version-single-location use case of RO-crates to create a more appropriate, more comprehensive description of DataLad datasets

jsheunis commented 11 months ago

Noting down what I see as possibly important elements from the HCLS specification, mainly taking properties that the specification defines as MUST and SHOULD for the three levels of description:

Dataset summary

Element	Property	Value	Requirement
Type declaration	rdf:type	dctypes:Dataset	MUST
Title	dct:title	rdf:langString	MUST
Description	dct:description	rdf:langString	MUST
Publisher	dct:publisher	IRI	MUST
HTML page	foaf:page	IRI	SHOULD
Logo	schemaorg:logo	IRI	SHOULD
Update frequency	dct:accrualPeriodicity	IRI of type dctypes:Frequency	SHOULD
SPARQL endpoint	void:sparqlEndpoint	IRI	SHOULD

Dataset version

Element	Property	Value	Requirement
Type declaration	rdf:type	dctypes:Dataset	MUST
Title	dct:title	rdf:langString	MUST
Description	dct:description	rdf:langString	MUST
Date created	dct:createD	rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype	SHOULD
Creators	dct:creator	IRI	MUST
Publisher	dct:publisher	IRI	MUST
Date of issue	dct:issued	rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype	SHOULD
HTML page	foaf:page	IRI	SHOULD
Logo	schemaorg:logo	IRI	SHOULD
License	dct:license	IRI	SHOULD
Language	dct:language	http://lexvo.org/id/iso639-3/{tag}	SHOULD
Version identifier	pav:version	xsd:string	MUST
Version linking	dct:isVersionOf	IRI	MUST
Version linking	pav:previousVersion	IRI	SHOULD
Data source provenance	dct:source or pav:retrievedFrom or prov:wasDerivedFrom	IRI	SHOULD
Creation tool	pav:createdWith	IRI	SHOULD
Distribution description	dcat:distribution	IRI of Distribution Level description	SHOULD

Dataset distribution

Element	Property	Value	Requirement
Type declaration	rdf:type	dctypes:Dataset	SHOULD
Type declaration	rdf:type	void:Dataset or dcat:Distribution	MUST
Title	dct:title	rdf:langString	MUST
Description	dct:description	rdf:langString	MUST
Date created	dct:createD	rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype	SHOULD
Creators	dct:creator	IRI	MUST
Publisher	dct:publisher	IRI	MUST
Date of issue	dct:issued	rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype	SHOULD
HTML page	foaf:page	IRI	SHOULD
Logo	schemaorg:logo	IRI	SHOULD
License	dct:license	IRI	MUST
Language	dct:language	http://lexvo.org/id/iso639-3/{tag}	SHOULD
Vocabulary used	void:vocabulary	IRI	SHOULD
Standards used	dct:conformsTo	IRI	SHOULD
Example identifier	idot:exampleIdentifier	xsd:string	SHOULD
Example resource	void:exampleResource	IRI	SHOULD
Version identifier	pav:version	xsd:string	SHOULD
Version linking	pav:previousVersion	IRI	SHOULD
Data source provenance	dct:source or pav:retrievedFrom or prov:wasDerivedFrom	IRI	SHOULD
Creation tool	pav:createdWith	IRI	SHOULD
File format	dct:format	IRI or xsd:String	MUST
File URL	dcat:downloadURL	IRI	SHOULD
Byte size	dcat:byteSize	xsd:decimal	SHOULD
RDF File URL	void:dataDump	IRI	SHOULD
Linkset	void:subset	IRI	SHOULD

jsheunis commented 11 months ago

I used some of the above properties (most of MUST ones, some of the SHOULD ones), and the following schema is what I can currently come up with. I'd like to put this up for discussion (would it be better to do so in a dedicated PR?)

Main classes: Dataset, DatasetVersion, DatasetDistribution
Dataset identifies the datalad dataset and contains mostly unchanging metadata
DatasetVersion --> versionOf --> Dataset
DatasetVersion --> previousVersion --> DatasetVersion
DatasetVersion --> hasPart --> File (many)
DatasetVersion --> hasDistribution --> DatasetDistribution (many)
DatasetDistribution would contain information about a remote where the dataset's content would be accessible from

Note: I focused only on finding a sensible way to structure the HCLS concepts into a schema that could be the starting point for us, and haven't yet spent time on mapping our git/datalad concepts and identifiers onto this schema.

[UPDATED]

---
id: https://w3id.org/psychoinformatics-de/datalad-schema
name: datalad-schema
title: datalad-schema
description: |-
  DataLad dataset schema
license: MIT
see_also:
  - https://psychoinformatics-de.github.io/datalad-schema
prefixes:
  afo: http://purl.allotrope.org/ontologies/result#
  datalad_schema: https://w3id.org/psychoinformatics-de/datalad-schema/
  dcat: http://www.w3.org/ns/dcat#
  dct: http://purl.org/dc/terms/
  dctypes: http://purl.org/dc/dcmitype/
  linkml: https://w3id.org/linkml/
  obo: https://purl.obolibrary.org/obo/
  ORCID: https://orcid.org/
  pav: http://purl.org/pav/
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  schema: http://schema.org/
  spdx: https://spdx.org/licenses/
  xsd: http://www.w3.org/2001/XMLSchema#
default_prefix: datalad_schema
default_range: string

# imports
imports:
  - linkml:types

# classes are the main organization until for data;
# all data records instantiate a class
classes:

  Dataset:
    class_uri: dctypes:Dataset
    slots:
      - description
      - identifier
      - publisher
      - title
      - type

  DatasetVersion:
    class_uri: dctypes:Dataset
    slots:
      - creator
      - dateCreated
      - description
      - hasDistribution
      - hasPart
      - previousVersion
      - publisher
      - title
      - type
      - version
      - versionOf

  DatasetDistribution:
    class_uri: dctypes:Dataset
    slots:
      - creator
      - dateCreated
      - description
      - format
      - license
      - publisher
      - title
      - type
    slot_usage:
      type:
        range: dcat:Distribution

  File:
    class_uri: schema:DigitalDocument
    slots:
      - checksum_md5
      - path_posix
      - size_in_bytes
      - url

  Person:
    class_uri: schema:Person
    slots:
      - email
      - name
    slot_usage:
      email:
        pattern: "^\\S+@[\\S+\\.]+\\S+"

# slots are first-class entities in the metamodel
# declaring them here allows them to be reused elsewhere
slots:
  checksum_md5:
    slot_uri: obo:NCIT_C171276
  creator:
    slot_uri: dct:creator
    multivalued: true
    inlined_as_list: true
    range: Person
  dateCreated:
    slot_uri: dct:created
  description:
    slot_uri: dct:description
    range: rdf:langString
  email:
    slot_uri: schema:email
  format:
    slot_uri: dct:format
  hasPart:
    slot_uri: schema:hasPart
    multivalued: true
    inlined_as_list: true
    range: File
  hasDistribution:
    slot_uri: dcat:distribution
    multivalued: true
    inlined_as_list: true
    range: DatasetDistribution
  identifier:
    identifier: true
    slot_uri: schema:identifier
  license:
    slot_uri: dct:license
  name:
    slot_uri: schema:name
  path_posix:
    slot_uri: afo:AFR_0001928
  previousVersion:
    slot_uri: pav:previousVersion
    range: DatasetVersion
  publisher:
    slot_uri: dct:publisher
  size_in_bytes:
    slot_uri: schema:contentSize
    range: integer
    unit:
      ucum_code: byte
  title:
    slot_uri: schema:title
    range: rdf:langString
  type:
    slot_uri: rdf:type
    range: dctypes:Dataset
  url:
    slot_uri: schema:contentUrl
  version:
    slot_uri: pav:version
    range: xsd:string
    identifier: true
  versionOf:
    slot_uri: dct:isVersionOf
    range: Dataset

jsheunis commented 11 months ago

Adding some older but applicable notes from @mih:

Dataset (concept)

DataLad identifier: dataset id

Dataset (version)

DataLad identifier: Git commit SHA
Properties
- author
- dateCreated
- previousVersion (range: Dataset version)
- hasPart (range: Dataset version content)

Dataset content (concept)

DataLad identifier: :<relative (POSIX) path>

Dataset version content (concept)

DataLad identifier: :<relative (POSIX) path>

File content (version)

DataLad identifier: Git blob SHA or git-annex key
Properties
- Availability: remote location
- Size: in bytes

(Remote) location (distribution)

A place where a dataset or individual dataset components are available, can be deposited or retrieved.
The record must communicate
- what kinds of objects are supported
- which access/deposition method is supported
- what type of identifier is needed for retrival requests
- DataLad identifier: Git remote URL or git-annex repository/special-remote UUID

jsheunis commented 11 months ago

In order to be able to specify the data model more concretely, I need to deepen my understanding of files and content in relation to git-annex / git, how they are defined, how they evolve, which relationships they have to which entities.

If the above distinction between Dataset content (concept), Dataset version content (concept), and File content (version) is still applicable, I need to get clarity on this.

Looking at relationships, I can understand that: Dataset (version) -> hasPart -> Dataset version content (concept), i.e. a dataset version can have many content items, each at a specific version. But I do not yet understand the relationships with Dataset content (concept) and File content (version).

mih commented 7 months ago

I believe that we have now settled on a schema that can express all this, with a relatively minimalistic setup.

psychoinformatics-de / datalad-concepts