w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
150 stars 47 forks source link

Scope of DCAT - datasets, or digital descriptions #1235

Closed dr-shorthair closed 3 years ago

dr-shorthair commented 4 years ago

Discussions in #1221 and elsewhere indicate that, while DCATv2 was extended to allow for cataloguing data-services, and includes the class dcat:Resource to serve as an extension point for additional applications, there is still some unease about the use of the DCAT model for applications beyond the datasets use-case that drove the original development of DCAT.

The question probably comes down to the general scope of DCAT: is it primarily (a) a pattern and vocabulary for catalogs of descriptions of interesting things (b) a vocabulary for describing and cataloguing data

The general concern is buried in a much longer thread which was triggered by the question of whether 'software' could be classified as a 'dataset' in order to fit into DCAT. So I'm creating this new issue so that we can have the general discussion more transparently.

dr-shorthair commented 4 years ago

I am of the opinion that the DCAT model is suitable for cataloguing many things, not just datasets. In https://github.com/w3c/dxwg/issues/1221#issuecomment-607040099 I sketched an example of a description of a physical specimen, in a way that could be a member of a DCAT catalog of specimens.

On the other hand, the name of the vocabulary 'Data Catalog Vocabulary' appears to be more narrowly scoped. We discussed this briefly in the DCAT task-group telecon last week, and it was suggested that, if the scope is more general than 'data', then we should consider changing the name. For example 'Digital Catalog Vocabulary` would allow us to retain the acronym, while better reflecting an extended scope.

kcoyle commented 4 years ago

@dr-shorthair Simon, thanks for starting this.

In the spirit of "anyone can say anything about anything" I don't think that one would want to try to limit the uses of DCAT. That, however, is different from encouraging uses that have yet to be shown as fruitful. Before making any changes to the DCAT documentation, it would seem to me that a good approach would be to begin to gather and publicize all uses of DCAT and see what develops in the real world.

There's a related issue, though, which is whether the use of some portion of the DCAT vocabulary (any of the classes or properties) = an instance of DCAT. This is a general problem in the mix'n'match world, not specific to DCAT, but needs to inform the group's thinking. What is it that makes a DCAT instance? Anything from the standard? Everything? This opens the question of how a DCAT instance is defined. It may be that the only viable definition comes from the DCAT-AP and related specifications, because those have constraints (e.g. what classes are mandatory). In any case, I feel that there has to be some definition against which instances can be measured before one can say which are instances of DCAT. I have my own thoughts on what distinguishes DCAT from other catalogs, but that may be for a different thread.

akuckartz commented 4 years ago

In any case, I feel that there has to be some definition against which instances can be measured before one can say which are instances of DCAT.

Where does the requirement "instances of DCAT" come from?

makxdekkers commented 4 years ago

On the issue of what is an instance of DCAT, I don't know under which circumstances the question would be relevant. There are just cases where (some of) DCAT is used. What would be the purpose of answering 'yes' or 'no' to the question? I hope it's not to make a value judgment, declaring some applications of DCAT to be 'wrong'. As @kcoyle writes, it's an issue -- I wouldn't call it a problem -- with mixing and matching. I might have an application that mixes and matches Dublin Core, DCAT, ADMS, schema.org and more. Is a snippet of metadata that contains some or all of those an 'instance' of all of them? Should we care?

kcoyle commented 4 years ago

@akuckartz @makxdekkers Simon's statements and questions:

the use of the DCAT model for applications beyond the datasets use-case

The question probably comes down to the general scope of DCAT: is it primarily (a) a pattern and vocabulary for catalogs of descriptions of interesting things (b) a vocabulary for describing and cataloguing data

give me to wonder what is meant by DCAT in "general scope of DCAT" and "the DCAT model". Is the question here about DCAT as in Figure 1 of the spec? Or is this about any use of any of the DCAT namespace classes and properties? Given that anyone can say anything about anything, if there is no conformance expected, what do the questions mean?

And, once again, I think that one should allow metadata communities to define their own uses. One shouldn't declare a use for others. Thus, changing the D in DCAT from Dataset to Digital seems presumptuous to me, unless there are non-dataset folks coming forward to become part of the DCAT community. That could be encouraged, and if it comes to be then a name change would make sense to me.

makxdekkers commented 4 years ago

@kcoyle I am with you. Unless there is a strong pull from current or potential implementers to extend the intended scope of DCAT, I see no need to make such a name change. It creates the risk that people will get confused.

agreiner commented 4 years ago

I think there are really two questions underlying this issue. One is the one Simon has posed, whether we should widen the scope of DCAT to things that are not data but seem to be usefully described by DCAT. (I think designing the vocabulary for that would be a mistake, as it would dilute its usefulness for data and lead us into trying to address an unbounded list of possible use cases.) The other is how broadly we welcome datasets that are not traditional ones. I have been thinking lately that we should begin to think about how to address machine learning datasets, so that would include images for sure, text corpora, possibly software, but I would only want to include these things when their intended purpose is use as data. The U.S. Dept. of Energy is now in the midst of a call for proposals about making data for artificial intelligence FAIR. I think that will generate a whole set of use cases that we should consider.

agbeltran commented 4 years ago

With the addition of dcat:Resource on DCAT2, we have already indicated that this class is "an extension point for defining a catalog of any kind of resource" (see https://www.w3.org/TR/vocab-dcat-2/#dcat-scope), and thus we have offered the possibility of a broader scope beyond datasets and data services.

Even if our main focus is around datasets and data services (as represented by the use cases and requirements we have addressed and those that are pending), I think it would be useful to show how the DCAT2 terminology could be used for cataloguing other kinds of resources. This doesn't mean that we need to describe those resources (and in fact, we probably shouldn't).

agbeltran commented 4 years ago

About this comment @kcoyle

Thus, changing the D in DCAT from Dataset to Digital seems presumptuous to me, unless there are non-dataset folks coming forward to become part of the DCAT community. That could be encouraged, and if it comes to be then a name change would make sense to me.

The advantage I would see on changing the name (while not the acronym) would be to emphasise the cataloguing aspects of the vocabulary rather than the data cataloguing aspects. This might help raising awareness between non-dataset folks on the cataloguing capabilities beyond data.

smrgeoinfo commented 4 years ago

I agree with @agbeltran -- the base DCAT model provides a foundation for describing many kinds of resources-- not just datasets. As it stands, applications or communities using DCAT as a foundation will need to provide validation resources and documentation for their resource descriptions. The application profile spec (DCAT-AP) provides guidance on doing that. I think DCAT already has 'a pattern and vocabulary for catalogs of descriptions of interesting things'; messaging that it's only for datasets is not necessary, and misses an opportunity for a more broadly useful documentation scheme.

Looking at the UML model in DCAT Scope, it seems to me that the model can be quite nicely modularized, with all but dcat:Dataset and dcat:DataService in a generic 'RCAT' (resource catalog) module, and cat:DataSet, dcat:DataService in a 'Dataset' module. Distribution should be a property on any resource; a Distribution should specify the representations available for the resource, as well as the requests necessary to obtain a particular representation.

My hope is that a vocab like DCAT (and RCAT?) can be integrated into schema.org (sdo) to take advantage of the (apparently) lower impedance to adoption for sdo. Better yet supersede SDO with a semantically coherent vocabulary that enables inference and automation. sdo already has 'CreativeWork' as the generic class that appears to most closely correspond to 'Resource' in the sense of DCAT.

riccardoAlbertoni commented 4 years ago

I agree that DCAT is useful for describing catalogs that are not necessarily mere data catalogs.

+1 to emphasize the "DCAT catalog modeling pattern," explaining that in the DCAT document that there are opportunities to employ DCAT also for catalogs that are not mere data catalogs. The notion of the dataset adopted in DCAT is very inclusive, and in case of cataloged things that don't fall in that inclusive definition, we have added DCAT resources. +1 to add examples in the primer or the appendix that shows these opportunities, as this makes it more recognizable that we provide a catalog modeling patten.

However, providing a useful pattern does not imply DCAT is enough for satisfying all the requirements that digital catalogs might have in general, nor it suffices to promote interoperability among them.

So I am reluctant to rename DCAT as "DIgital CATalogs" at the moment.

I would emphasize more the "DCAT Catalog pattern" by mean of the additional explanations and examples, and then I would use the next publishing draft for collecting feedback, adoption examples, more use cases to ground the renaming or any more substantial change in the DCAT scope.

agbeltran commented 4 years ago

We discussed this issue today and agreed:

Consideration of the scope of DCAT - just data or also other kinds of resources - is important but cannot be resolved now. We should proceed to develop examples and use-cases, then we can later consider whether the vocabulary needs re-naming or not.

andrea-perego commented 3 years ago

@agbeltran said:

We discussed this issue today and agreed:

Consideration of the scope of DCAT - just data or also other kinds of resources - is important but cannot be resolved now. We should proceed to develop examples and use-cases, then we can later consider whether the vocabulary needs re-naming or not.

I therefore propose we close this issue.

andrea-perego commented 3 years ago

@agbeltran said:

We discussed this issue today and agreed:

Consideration of the scope of DCAT - just data or also other kinds of resources - is important but cannot be resolved now. We should proceed to develop examples and use-cases, then we can later consider whether the vocabulary needs re-naming or not.

I therefore propose we close this issue.

No objections raised. Closing this issue.