w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
139 stars 55 forks source link

Proposal to include a property for "type of data" contained in a dataset #1548

Open svituz opened 1 year ago

svituz commented 1 year ago

I think it can be useful for Dataset to have a property that indicates the type of data that can be found in it (such as dcat:collectedData or dcat:typeOfData). For example, for a Dataset with clinical data, it can be useful to have such a property, whose value may be taken from controlled vocabularies or ontologies (e.g. LOINC, SNOMED) to express concepts such as "Laboratory Exams", "Vital Signs Obervation" etc.. I think it can be useful to increase the findability of datasets, also for other domains.

Any thoughts about this?

rob-metalinkage commented 1 year ago

This speaks to the ambiguity in the semantics of dcterms:conformsTo... does it relate to the subject or the resource the subject describes?

Any data service is going to have at least 5 different conformance aspects.. access method, data model, data content type and service level.

Possibly an aspect oriented qualified relationship is needed.

Dimensions of data may be separate aspects.. or a compound aspect. Rdf datacube provides quite powerful starting point for this.

andrea-perego commented 1 year ago

@svituz , your requirement it is not completely clear to me.

Is it about conformity and data structure definition, as per @rob-metalinkage 's comment? Or is it about the classification of a dataset - which in DCAT is done via dcat:theme?

It would be useful if you could provide a full example, ideally also with its RDF representation.

svituz commented 1 year ago

@andrea-perego let's say I have a Dataset that collects data about a clinical study regarding a specific disease (the theme). The data you can find in the Dataset contains laboratory exams. I think these are two different concepts. The RDF example would be:

:studyDataset1 a dcat:Dataset;
dcat:theme icd10:C50;
dcat:collectedData obo:NCIT_C25294.

This RDF would describe a Dataset containing laboratory exams of clinical cases with breast cancer, for example.

You can have another one with the same theme (the disease) but a different type of data for example digital x ray

:studyDataset2 a dcat:Dataset;
dcat:theme icd10:C50;
dcat:collectedData obo:NCIT_C18001.

Even in the case that dcterms:conformsTo relates to the resources and not the subject (I thought the second scenario) I see it applicable to indicate a model describing the structure of the laboratory exams and not to say that you can find laboratory exams in it.

Hope to have clarified my issue and that it makes sense. I looked a lot in dcat (and also outside of dcat) to solve this issue we have but couldn't find a satisfying solution.

If you're interested in the context, here you can read about it

init-dcat-ap-de commented 11 months ago

Could you use the PROV Ontology?

rob-metalinkage commented 11 months ago

I'll circle back to my comment - there are multiple aspects - do you want a property for every possible one or a flexible mechanism to support documentation.

a well known ontology could provide predicates (such as prov:wasGeneratedBy)

in this case you need to describe data structure - typically container organisation or custom application schema (and specialised profiles thereof), data dimensions (e.g. RDF datacube), nature of data elements within the containers etc.

Profiles of DCAT supporting available descriptive ontologies would be better than half-implementing via a limited set of properties.

bertvannuffelen commented 2 months ago

@svituz when reading your case I get the feeling it could be resolved with as @rob-metalinkage mentioned creating a proper DCAT profile.

For you specific profiling case, DCAT has 3 options for classifications:

In the DCAT-AP ecosystem we encounter this need regulary that one would like to classify the datasets according to some domain specific needs. There are two strategies here:

The second option is the safest in case one deals with datasets that must be documented by multiple DCAT profiles. It means that in your domain you can express a specific set of constraints on that one, and that any other user of that metadata can use it as if it was dct:subject.

:studyDataset1 a dcat:Dataset;
dcat:theme icd10:C50;
profile:collectionOrigin obo:NCIT_C25294.

profile:collectionOrigin rdfs:subPropertyOf dct:subject.
profile:collectionOrigin rdfs:description "The origin of data collection"@en.
profile:collectionOrigin skos:note "This is indicated using the NCIT classification"@en. 

You can use this pattern for as many classifications you want without loosing compatibility with DCAT (because of the subPropertyOf relationship).