w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
153 stars 47 forks source link

Generalize domain of dcat:distribution #1576

Open dgarijo opened 1 year ago

dgarijo commented 1 year ago

I am submitting this issue on behalf of the FAIR-Impact community, and per suggestion of (@agbeltran).

We would like to reuse DCAT for describing catalogs semantic resources, creating a profile for semantic artefacts (ontologies and vocabularies). Our community represents the Linked Open Vocabularies and the Agrovoc catalogs among others.

Our request is whether it is possible to generalize the domain of dcat:Distribution to dcat:Resource, as not all things with distributions are necessarily datasets. For example, if we want to build a catalog of ontologies, websites, software tools, or papers. All these are resources that have distribution, but are not necessarily Datasets. We feel that extending dcat:Dataset for all these resources is like shoehorning the standard (e.g., properties like temporalResolution do not apply) .

Thanks in advance!

makxdekkers commented 1 year ago

I am wondering why you think that semantic resources like ontologies and vocabularies cannot be described as dcat:Dataset. I have actually recently created a application profile for reference data assets, things like code lists, taxonomies and SKOS concept schemes, and had no problem using DCAT for that. Obviously, some of the properties defined for dcat:Dataset may not be relevant for all types of dataset, but you're not obliged to use them if they make no sense for your collection. My worry of widening the domain to dcat:Resource is that that class is basically semantically empty, or were you planning to create a new class for semantic resources? If the latter, maybe you could look at ADMS?

dgarijo commented 1 year ago

Hi @makxdekkers, I do not know what ADMS is, sorry. Can you please provide more details?

In our group, we are discussing the extension of MOD (https://github.com/FAIR-IMPACT/MOD) as a DCAT profile. The extension of dcat:Dataset would imply stating that all ontologies are datasets, and some of us feel like this is not the case. And I think that tools, websites and papers are other elements that are usually catalogued that do not qualify just as "data".

We will probably be creating a sister class for the extension, but it would be great if we can directly extend dcat:Resource and be able to use dcat:distribution.

dgarijo commented 1 year ago

Tagging @mariapoveda to the thread so she can provide more details

H-a-g-L commented 1 year ago

To add to @makxdekkers' comment, the intentionally broad definition of dcat:Dataset would allow you to include the other assets you were referring to (websites, software tools etc.).

In the European Data Catalogue, ontologies and controlled vocabularies are typed as dcat:Dataset. dct:type is then used to specialise them as a particular type of dataset. For example, this is how the Eurovoc thesaurus is encoded:

<dcat:Dataset rdf:about="http://data.europa.eu/88u/dataset/eurovoc">
    <dct:type>
        <skos:Concept rdf:about="http://publications.europa.eu/resource/authority/dataset-type/THESAURUS"/>

A similar approach was used for assets that may not seem “obvious” members of the dcat:Dataset class in the mapping done (@andrea-perego) from DataCite to DCAT.

ADMS is a vocabulary for describing semantic assets. The latest release is available at: https://semiceu.github.io/ADMS/releases/2.00/

dgarijo commented 1 year ago

Thanks for your answers and the link to ADMS, I was not aware of that W3C note (@agbeltran maybe we can use it also for inspiration).

I was reviewing dcat, and a dataset is defined as a collection of data. I find difficult to map a tool to a collection of data. And an ontology too. Stretching out the definition you could say that a tool is a collection of bytes, or that an ontology may be seen as a collection of axioms/triples. But then anything is dataset, right? A Person is a collection of cells, a cell is a collection of atoms, a building is a collection of bricks, a sentence is a collection of words, etc.

If that is the case, is there really a difference between dcat:Resource and dcat:Dataset except for having a distribution? If a resource dcat:Resource represents a dataset, a data service or any other resource that may be described by a metadata record in a catalog, then I think my point on having distribution associated with dcat:Resources is not crazy, no? Are there any examples of dcat:Resources that can be kept in catalogues but do not have distributions? Basically, are there any Resources that are not Datasets? I was thinking maybe a rock, but they do have physical distributions (where to find the rock in real life)

Thanks in advance!

agbeltran commented 1 year ago

Thanks @dgarijo for starting the discussion and @makxdekkers and @ODP-hil for comment.

Indeed, the discussion started when several in the FAIR-IMPACT group (and from previous discussions in FAIRsFAIR) proposed to derive mod:SemanticArtefact from dcat:Dataset, and I agree with that view given the broad definition of dataset in DCAT.

However, a few people were opposed to this and would be more comfortable deriving from dcat:Resource.

As the dichotomy dataset/distribution is important, and we want to re-use it for semantic artefacts, we thought that a compromise would be to derive from dcat:Resource but still use thedcat:distribution property (and the relationship with dcat:Dataset would be inferred). The discussion is here and I will aim to summarise these points there too: https://github.com/FAIR-IMPACT/MOD/discussions/34

While I am on the view that we could use dcat:Dataset as ADMS does, there is the question of other resources (e.g. software) where there could still be a benefit of generalising dcat:distribution.

makxdekkers commented 1 year ago

Is this reluctance to describe things like ontologies as dcat:Dataset philosophical or practical? By practical I mean, if you were to define a new subclass of dcat:Resource (which I think the specification recommends), would the set of properties be very different from the set of properties defined for dcat:Dataset? So different that it can't be handled by a type and use of only a few properties? You already want to use dcat:distribution and other properties may not be relevant, but there is no obligation to use them. I see two issues to do with interoperability:

  1. some people are already describing the things you work on as dcat:Dataset, using the Controlled Vocabulary for Dataset type of the Publications Office of the EU.
  2. As far as I know, most implementations of DCAT store and exchange descriptions of datasets and probably none of them process dcat:Resources

It is my worry that by doing things differently, your work is going to be in a different silo from other implementations, making it harder to achieve interoperability. But of course, interoperability with others outside your group may not be a crucial requirement in your case.

dgarijo commented 1 year ago

@makxdekkers, I think the rationale for the reluctance is a little bit mixed. On the one hand some people in our group feel like calling/extending everything from a Dataset is a little unnatural (however, they are ok with using Resource). And looking into the concrete definitions, it is not clear whether there is a big difference between dcat:Resource and dcat:Dataset, as I elaborated above.

On the other hand, there are some Dataset properties that would not be used. Yes, we can add a new type and just not use those properties, but it is not very practical to have them. As an intermediate solution I have proposed extending Resource and using distribution, which would essentially make our profile an implicit extension of Dataset without explicitly asserting it.

I think that the DCAT standard should probably motivate why is it important to have Resources and Datasets and why these concepts are different by definition. For example, with examples on why a dcat:Resource may not be necessarily a dcat:Dataset.

I see no interoperability issues, because if at some point you want to interoperate with any of those services, we can issue a construct query adding the corresponding dctypes. But I think this point is another discussion.

dr-shorthair commented 1 year ago

it is not clear whether there is a big difference between dcat:Resource and dcat:Dataset

dcat:Resource also subsumes dcat:DataService. As noted in the scope statement, "dcat:Resource is actually an extension point for defining a catalog of any kind of resources." so indeed it is the natural extension point if you wish to catalog other kinds of thing.

makxdekkers commented 1 year ago

But also the scope statement says that dcat:Resource is a "parent class" that is not "intended to be used directly". So it would make sense, if you deem your resources to be sufficiently different from datasets, to define a new subclass of dcat:Resource.

What I do not fully understand is that @dgarijo says that you want to "make our profile an implicit extension of Dataset without explicitly asserting it". This seems to say that your resources are not enormously different from datasets but just a variation. In that case, I wonder if declaring it as a separate class makes things more confusing?

dgarijo commented 1 year ago

Thanks for your answers. Yes, we will create a separate class in our profile. The discussion stems from our class not being a dcat:Dataset but a dcat:Resource with a dcat:distribution. According to that, it would make our new class a dcat:Dataset, hence the request to generalize dcat:distribution with domain dcat:Resource instead of dcat:Dataset.

What I meant by making our profile an implicit extension is a little of a hack I proposed to our group to make everyone happy: We don't extend dcat:Dataset, extending instead dcat:Resource, but we still use dcat:distribution. Then if you infer triples this would make our extension a type of dcat:Dataset, but only if you apply inference. Does that make it clear?

mariapoveda commented 1 year ago

Sorry for arriving quite late.

I also see unnatural to classify ontologies as Dataset (see the difference with classifiying skos vocabularies as datasets which looks fine), an ontology contains definitions rather than a set of data or facts.

The point is that for OWL ontologies, for example, would be needed to have distributions but at Resource level or a level sibling to Dataset. It happens that the differences between Resource and Dataset are not so clear (e.g. too general definitions opening the door to "other..." and the EU list of resources) and seems that the term Dataset is being used for duplicate the Resource concept to keep it general as "it was not intended to be used directly".

dr-shorthair commented 1 year ago

What does 'unnatural' mean?

If we use the genus/differentia approach to classification, then an ontology is 'a dataset that is composed of axioms' which could be compared with a SKOS vocabulary which is 'a dataset that is composed of concept definitions' or an image which is 'a dataset composed of pixels' or a catalog which is 'a dataset composed of metadata records'.

If all the other descriptors associated with a dataset still pertain, then how is it unnatural?

agbeltran commented 1 year ago

Following up on this discussion, people may disagree if a semantic artefact is, or should be represented as, a dataset or not and this will have an impact on interoperability in some cases.

However, the objective of this issue was to discuss if it makes sense for DCAT to generalise the domain of dcat:distribution, which is currently dcat:Dataset.

So, the point is if there are entities other than datasets that may use dcat:distribution, then we could generalise the domain to be dcat:Resource instead of dcat:Dataset.

This may not be important for those communities where "anything" may be represented as a dataset, but this becomes important for those communities in which a semantic artefact, or software, etc is not represented as a dcat:Dataset but they still have multiple distributions (representations/serialisations).

What do people think about generalising the domain of dcat:distribution to dcat:Resource?

mariapoveda commented 1 year ago

I would agree with the generalization of dcat:distribution to dcat:Resource so that it can be applied to entities/resources/assets that are not necessarily datasets.

dr-shorthair commented 1 year ago

While I stand by my comment above (that no differentia have been proposed that make an Ontology different to a Dataset), I understand that

  1. there may be a social objection to classifying an Ontology as a Dataset
  2. there may be other subclasses of dcat:Resource that have Distributions, but are not Datasets

So I'm OK with the proposal to relax the domain of dcat:distribution to dcat:Resource.

pwin commented 1 year ago

would it be helpful to have a super-property of dcat:distribution that was more relaxed, rather than to change dcat:distribution?

riccardoAlbertoni commented 1 year ago

I have labeled this issue "future work" as the data exchange working group (DXWG) has voted for the DCAT 3 Candidate Rec, and the process requires that we crystalize the features and changes included in the third release of DCAT. The GitHub issue remains open to facilitate ongoing discussion, ensuring that its outcome can be taken into account in subsequent rounds of DCAT standardization.

bertvannuffelen commented 6 months ago

As a reflection, the WG should consider if dcat:Resource has a semantical meaning or if it just an alternative for rdfs:Resource.

I always have read dcat:Resource as a Catalogued Resource, i.e. a resource that is in the catalogue and which is actively managed by the catalogue. By making a Distribution a dcat:Resource subclass, it becomes a directly managed entity. Thus a catalogue of Distributions is fine.

Then also dcat:distribution gets maybe a different semantics as dcat:dataset is a subproperty of dcat:resource and thus for coherency reasons also dcat:distribution must be a subproperty of dcat:resource.
But then all datasets become catalogues... as dcat:resource has as domain dcat:Catalog.

jonquet commented 4 months ago

Hello all, I never realized part of our discussion in FAIR-IMPACT was moved here. Sorry. Just to keep the pointer:https://github.com/FAIR-IMPACT/MOD/discussions/34 This is the discussion we are talking about.

IMO we can see a mod:SemanticArtefact explicitly as a dcat:DataSet (indeed as a dataset of terms which does not mean that everything is a dataset too...). I think this is the case not that much because an ontology match the defintion of "collection of data" but mostly because MOD adheres to the general principle and philosophy of DCAT which is to describe datasets (as a broader term) that can be catalogued and served and distributed. To me this is want to do for SA and then adopting DCAT give the path, and we should not get stuck on the way with the restrictive description as it is of dcat:Dataset.

If DCAT enable dcat:Distribution for resource and not dcat:Dataset, to me this will create more confusion than benefits.