psychoinformatics-de / datalad-concepts

Other
3 stars 2 forks source link

Map the HCLS community profile description onto a datalad dataset #60

Closed jsheunis closed 7 months ago

jsheunis commented 11 months ago

Initial comments by @mih:

The HCLS dataset description specification appears to provide a range of required concepts that are missing in RO-crates. Importantly, this includes:

  • Distinction of summary-level, version-level, and distribution-level entities that could be mapped onto evolving datasets with multiple versions that are deposited in multiple locations.

The HCLS specification appears to be more geared towards a-file-as-a-dataset use cases. However, the aforementioned distinctions could be combined with the single-version-single-location use case of RO-crates to create a more appropriate, more comprehensive description of DataLad datasets

jsheunis commented 11 months ago

Noting down what I see as possibly important elements from the HCLS specification, mainly taking properties that the specification defines as MUST and SHOULD for the three levels of description:

Dataset summary

Element Property Value Requirement
Type declaration rdf:type dctypes:Dataset MUST
Title dct:title rdf:langString MUST
Description dct:description rdf:langString MUST
Publisher dct:publisher IRI MUST
HTML page foaf:page IRI SHOULD
Logo schemaorg:logo IRI SHOULD
Update frequency dct:accrualPeriodicity IRI of type dctypes:Frequency SHOULD
SPARQL endpoint void:sparqlEndpoint IRI SHOULD

Dataset version

Element Property Value Requirement
Type declaration rdf:type dctypes:Dataset MUST
Title dct:title rdf:langString MUST
Description dct:description rdf:langString MUST
Date created dct:createD rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype SHOULD
Creators dct:creator IRI MUST
Publisher dct:publisher IRI MUST
Date of issue dct:issued rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype SHOULD
HTML page foaf:page IRI SHOULD
Logo schemaorg:logo IRI SHOULD
License dct:license IRI SHOULD
Language dct:language http://lexvo.org/id/iso639-3/{tag} SHOULD
Version identifier pav:version xsd:string MUST
Version linking dct:isVersionOf IRI MUST
Version linking pav:previousVersion IRI SHOULD
Data source provenance dct:source or pav:retrievedFrom or prov:wasDerivedFrom IRI SHOULD
Creation tool pav:createdWith IRI SHOULD
Distribution description dcat:distribution IRI of Distribution Level description SHOULD

Dataset distribution

Element Property Value Requirement
Type declaration rdf:type dctypes:Dataset SHOULD
Type declaration rdf:type void:Dataset or dcat:Distribution MUST
Title dct:title rdf:langString MUST
Description dct:description rdf:langString MUST
Date created dct:createD rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype SHOULD
Creators dct:creator IRI MUST
Publisher dct:publisher IRI MUST
Date of issue dct:issued rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype SHOULD
HTML page foaf:page IRI SHOULD
Logo schemaorg:logo IRI SHOULD
License dct:license IRI MUST
Language dct:language http://lexvo.org/id/iso639-3/{tag} SHOULD
Vocabulary used void:vocabulary IRI SHOULD
Standards used dct:conformsTo IRI SHOULD
Example identifier idot:exampleIdentifier xsd:string SHOULD
Example resource void:exampleResource IRI SHOULD
Version identifier pav:version xsd:string SHOULD
Version linking pav:previousVersion IRI SHOULD
Data source provenance dct:source or pav:retrievedFrom or prov:wasDerivedFrom IRI SHOULD
Creation tool pav:createdWith IRI SHOULD
File format dct:format IRI or xsd:String MUST
File URL dcat:downloadURL IRI SHOULD
Byte size dcat:byteSize xsd:decimal SHOULD
RDF File URL void:dataDump IRI SHOULD
Linkset void:subset IRI SHOULD
jsheunis commented 11 months ago

I used some of the above properties (most of MUST ones, some of the SHOULD ones), and the following schema is what I can currently come up with. I'd like to put this up for discussion (would it be better to do so in a dedicated PR?)

Note: I focused only on finding a sensible way to structure the HCLS concepts into a schema that could be the starting point for us, and haven't yet spent time on mapping our git/datalad concepts and identifiers onto this schema.

[UPDATED]

---
id: https://w3id.org/psychoinformatics-de/datalad-schema
name: datalad-schema
title: datalad-schema
description: |-
  DataLad dataset schema
license: MIT
see_also:
  - https://psychoinformatics-de.github.io/datalad-schema
prefixes:
  afo: http://purl.allotrope.org/ontologies/result#
  datalad_schema: https://w3id.org/psychoinformatics-de/datalad-schema/
  dcat: http://www.w3.org/ns/dcat#
  dct: http://purl.org/dc/terms/
  dctypes: http://purl.org/dc/dcmitype/
  linkml: https://w3id.org/linkml/
  obo: https://purl.obolibrary.org/obo/
  ORCID: https://orcid.org/
  pav: http://purl.org/pav/
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  schema: http://schema.org/
  spdx: https://spdx.org/licenses/
  xsd: http://www.w3.org/2001/XMLSchema#
default_prefix: datalad_schema
default_range: string

# imports
imports:
  - linkml:types

# classes are the main organization until for data;
# all data records instantiate a class
classes:

  Dataset:
    class_uri: dctypes:Dataset
    slots:
      - description
      - identifier
      - publisher
      - title
      - type

  DatasetVersion:
    class_uri: dctypes:Dataset
    slots:
      - creator
      - dateCreated
      - description
      - hasDistribution
      - hasPart
      - previousVersion
      - publisher
      - title
      - type
      - version
      - versionOf

  DatasetDistribution:
    class_uri: dctypes:Dataset
    slots:
      - creator
      - dateCreated
      - description
      - format
      - license
      - publisher
      - title
      - type
    slot_usage:
      type:
        range: dcat:Distribution

  File:
    class_uri: schema:DigitalDocument
    slots:
      - checksum_md5
      - path_posix
      - size_in_bytes
      - url

  Person:
    class_uri: schema:Person
    slots:
      - email
      - name
    slot_usage:
      email:
        pattern: "^\\S+@[\\S+\\.]+\\S+"

# slots are first-class entities in the metamodel
# declaring them here allows them to be reused elsewhere
slots:
  checksum_md5:
    slot_uri: obo:NCIT_C171276
  creator:
    slot_uri: dct:creator
    multivalued: true
    inlined_as_list: true
    range: Person
  dateCreated:
    slot_uri: dct:created
  description:
    slot_uri: dct:description
    range: rdf:langString
  email:
    slot_uri: schema:email
  format:
    slot_uri: dct:format
  hasPart:
    slot_uri: schema:hasPart
    multivalued: true
    inlined_as_list: true
    range: File
  hasDistribution:
    slot_uri: dcat:distribution
    multivalued: true
    inlined_as_list: true
    range: DatasetDistribution
  identifier:
    identifier: true
    slot_uri: schema:identifier
  license:
    slot_uri: dct:license
  name:
    slot_uri: schema:name
  path_posix:
    slot_uri: afo:AFR_0001928
  previousVersion:
    slot_uri: pav:previousVersion
    range: DatasetVersion
  publisher:
    slot_uri: dct:publisher
  size_in_bytes:
    slot_uri: schema:contentSize
    range: integer
    unit:
      ucum_code: byte
  title:
    slot_uri: schema:title
    range: rdf:langString
  type:
    slot_uri: rdf:type
    range: dctypes:Dataset
  url:
    slot_uri: schema:contentUrl
  version:
    slot_uri: pav:version
    range: xsd:string
    identifier: true
  versionOf:
    slot_uri: dct:isVersionOf
    range: Dataset
jsheunis commented 11 months ago

Adding some older but applicable notes from @mih:


Dataset (concept)

Dataset (version)

Dataset content (concept)

Dataset version content (concept)

File content (version)

(Remote) location (distribution)

jsheunis commented 11 months ago

In order to be able to specify the data model more concretely, I need to deepen my understanding of files and content in relation to git-annex / git, how they are defined, how they evolve, which relationships they have to which entities.

If the above distinction between Dataset content (concept), Dataset version content (concept), and File content (version) is still applicable, I need to get clarity on this.

Looking at relationships, I can understand that: Dataset (version) -> hasPart -> Dataset version content (concept), i.e. a dataset version can have many content items, each at a specific version. But I do not yet understand the relationships with Dataset content (concept) and File content (version).

mih commented 7 months ago

I believe that we have now settled on a schema that can express all this, with a relatively minimalistic setup.