Closed jsheunis closed 7 months ago
Noting down what I see as possibly important elements from the HCLS specification, mainly taking properties that the specification defines as MUST and SHOULD for the three levels of description:
Element | Property | Value | Requirement |
---|---|---|---|
Type declaration | rdf:type | dctypes:Dataset | MUST |
Title | dct:title | rdf:langString | MUST |
Description | dct:description | rdf:langString | MUST |
Publisher | dct:publisher | IRI | MUST |
HTML page | foaf:page | IRI | SHOULD |
Logo | schemaorg:logo | IRI | SHOULD |
Update frequency | dct:accrualPeriodicity | IRI of type dctypes:Frequency | SHOULD |
SPARQL endpoint | void:sparqlEndpoint | IRI | SHOULD |
Element | Property | Value | Requirement |
---|---|---|---|
Type declaration | rdf:type | dctypes:Dataset | MUST |
Title | dct:title | rdf:langString | MUST |
Description | dct:description | rdf:langString | MUST |
Date created | dct:createD | rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype | SHOULD |
Creators | dct:creator | IRI | MUST |
Publisher | dct:publisher | IRI | MUST |
Date of issue | dct:issued | rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype | SHOULD |
HTML page | foaf:page | IRI | SHOULD |
Logo | schemaorg:logo | IRI | SHOULD |
License | dct:license | IRI | SHOULD |
Language | dct:language | http://lexvo.org/id/iso639-3/{tag} | SHOULD |
Version identifier | pav:version | xsd:string | MUST |
Version linking | dct:isVersionOf | IRI | MUST |
Version linking | pav:previousVersion | IRI | SHOULD |
Data source provenance | dct:source or pav:retrievedFrom or prov:wasDerivedFrom | IRI | SHOULD |
Creation tool | pav:createdWith | IRI | SHOULD |
Distribution description | dcat:distribution | IRI of Distribution Level description | SHOULD |
Element | Property | Value | Requirement |
---|---|---|---|
Type declaration | rdf:type | dctypes:Dataset | SHOULD |
Type declaration | rdf:type | void:Dataset or dcat:Distribution | MUST |
Title | dct:title | rdf:langString | MUST |
Description | dct:description | rdf:langString | MUST |
Date created | dct:createD | rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype | SHOULD |
Creators | dct:creator | IRI | MUST |
Publisher | dct:publisher | IRI | MUST |
Date of issue | dct:issued | rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype | SHOULD |
HTML page | foaf:page | IRI | SHOULD |
Logo | schemaorg:logo | IRI | SHOULD |
License | dct:license | IRI | MUST |
Language | dct:language | http://lexvo.org/id/iso639-3/{tag} | SHOULD |
Vocabulary used | void:vocabulary | IRI | SHOULD |
Standards used | dct:conformsTo | IRI | SHOULD |
Example identifier | idot:exampleIdentifier | xsd:string | SHOULD |
Example resource | void:exampleResource | IRI | SHOULD |
Version identifier | pav:version | xsd:string | SHOULD |
Version linking | pav:previousVersion | IRI | SHOULD |
Data source provenance | dct:source or pav:retrievedFrom or prov:wasDerivedFrom | IRI | SHOULD |
Creation tool | pav:createdWith | IRI | SHOULD |
File format | dct:format | IRI or xsd:String | MUST |
File URL | dcat:downloadURL | IRI | SHOULD |
Byte size | dcat:byteSize | xsd:decimal | SHOULD |
RDF File URL | void:dataDump | IRI | SHOULD |
Linkset | void:subset | IRI | SHOULD |
I used some of the above properties (most of MUST ones, some of the SHOULD ones), and the following schema is what I can currently come up with. I'd like to put this up for discussion (would it be better to do so in a dedicated PR?)
Dataset
, DatasetVersion
, DatasetDistribution
Dataset
identifies the datalad dataset and contains mostly unchanging metadataDatasetVersion
--> versionOf
--> Dataset
DatasetVersion
--> previousVersion
--> DatasetVersion
DatasetVersion
--> hasPart
--> File
(many)DatasetVersion
--> hasDistribution
--> DatasetDistribution
(many)DatasetDistribution
would contain information about a remote where the dataset's content would be accessible fromNote: I focused only on finding a sensible way to structure the HCLS concepts into a schema that could be the starting point for us, and haven't yet spent time on mapping our git/datalad concepts and identifiers onto this schema.
[UPDATED]
---
id: https://w3id.org/psychoinformatics-de/datalad-schema
name: datalad-schema
title: datalad-schema
description: |-
DataLad dataset schema
license: MIT
see_also:
- https://psychoinformatics-de.github.io/datalad-schema
prefixes:
afo: http://purl.allotrope.org/ontologies/result#
datalad_schema: https://w3id.org/psychoinformatics-de/datalad-schema/
dcat: http://www.w3.org/ns/dcat#
dct: http://purl.org/dc/terms/
dctypes: http://purl.org/dc/dcmitype/
linkml: https://w3id.org/linkml/
obo: https://purl.obolibrary.org/obo/
ORCID: https://orcid.org/
pav: http://purl.org/pav/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
schema: http://schema.org/
spdx: https://spdx.org/licenses/
xsd: http://www.w3.org/2001/XMLSchema#
default_prefix: datalad_schema
default_range: string
# imports
imports:
- linkml:types
# classes are the main organization until for data;
# all data records instantiate a class
classes:
Dataset:
class_uri: dctypes:Dataset
slots:
- description
- identifier
- publisher
- title
- type
DatasetVersion:
class_uri: dctypes:Dataset
slots:
- creator
- dateCreated
- description
- hasDistribution
- hasPart
- previousVersion
- publisher
- title
- type
- version
- versionOf
DatasetDistribution:
class_uri: dctypes:Dataset
slots:
- creator
- dateCreated
- description
- format
- license
- publisher
- title
- type
slot_usage:
type:
range: dcat:Distribution
File:
class_uri: schema:DigitalDocument
slots:
- checksum_md5
- path_posix
- size_in_bytes
- url
Person:
class_uri: schema:Person
slots:
- email
- name
slot_usage:
email:
pattern: "^\\S+@[\\S+\\.]+\\S+"
# slots are first-class entities in the metamodel
# declaring them here allows them to be reused elsewhere
slots:
checksum_md5:
slot_uri: obo:NCIT_C171276
creator:
slot_uri: dct:creator
multivalued: true
inlined_as_list: true
range: Person
dateCreated:
slot_uri: dct:created
description:
slot_uri: dct:description
range: rdf:langString
email:
slot_uri: schema:email
format:
slot_uri: dct:format
hasPart:
slot_uri: schema:hasPart
multivalued: true
inlined_as_list: true
range: File
hasDistribution:
slot_uri: dcat:distribution
multivalued: true
inlined_as_list: true
range: DatasetDistribution
identifier:
identifier: true
slot_uri: schema:identifier
license:
slot_uri: dct:license
name:
slot_uri: schema:name
path_posix:
slot_uri: afo:AFR_0001928
previousVersion:
slot_uri: pav:previousVersion
range: DatasetVersion
publisher:
slot_uri: dct:publisher
size_in_bytes:
slot_uri: schema:contentSize
range: integer
unit:
ucum_code: byte
title:
slot_uri: schema:title
range: rdf:langString
type:
slot_uri: rdf:type
range: dctypes:Dataset
url:
slot_uri: schema:contentUrl
version:
slot_uri: pav:version
range: xsd:string
identifier: true
versionOf:
slot_uri: dct:isVersionOf
range: Dataset
Adding some older but applicable notes from @mih:
Dataset (concept)
Dataset (version)
Dataset content (concept)
Dataset version content (concept)
File content (version)
(Remote) location (distribution)
In order to be able to specify the data model more concretely, I need to deepen my understanding of files and content in relation to git-annex / git, how they are defined, how they evolve, which relationships they have to which entities.
If the above distinction between Dataset content (concept)
, Dataset version content (concept)
, and File content (version)
is still applicable, I need to get clarity on this.
Looking at relationships, I can understand that: Dataset (version)
-> hasPart
-> Dataset version content (concept)
, i.e. a dataset version can have many content items, each at a specific version. But I do not yet understand the relationships with Dataset content (concept)
and File content (version)
.
I believe that we have now settled on a schema that can express all this, with a relatively minimalistic setup.
Initial comments by @mih: