w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 46 forks source link

Relation among Dataset, Distribution and Data Service #1126

Open jakubklimek opened 4 years ago

jakubklimek commented 4 years ago

I am working on an implementation of DCAT2 (and DCAT-AP 2.0.0) in LinkedPipes DCAT-AP Forms, LinkedPipes DCAT-AP Viewer and the Czech National Open Data catalog, and currently, I am wondering about the relationship among Dataset, Distribution and Data Service, basically seeking additional insights from the WG.

Let me illustrate with an example of what seems clear.

Let us have a Dataset with one RDF TriG Distribution:

:dataset a dcat:Dataset ;
    dcat:distribution :distribution  .

:distribution a dcat:Distribution ;
    dcat:accessURL <https://data.cssz.cz/dataset/ciselnik-datovych-typu> ;
    dcat:downloadURL <https://data.cssz.cz/dump/ciselnik-datovych-typu.trig> ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/RDF_TRIG> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/application/trig> .

Now, I want to use DCAT2 to express, that there is a SPARQL Endpoint at http://data.cssz.cz/sparql serving this dataset. It is clear that I need an instance of dcat:DataService:

:dataservice a dcat:DataService ;
    dcat:endpointURL <https://data.cssz.cz/sparql> ;
    dcat:endpointDescription <https://data.cssz.cz/sparql> .

The questions start to appear when I think about how to interconnect those. At first, I was thinking along the lines of the diagram:

:dataservice dcat:servesDataset :dataset .
:distribution dcat:accessService :dataservice .

However, I got confused here.

:distribution describes a downloadable TriG file. At the moment, there is no way of getting a TriG file out of a SPARQL endpoint. Therefore,

:distribution dcat:accessService :dataservice

suddenly does not make sense. But how to use dcat:accessService from a distribution then?

Should it be that the Dataset actually has 2 distributions like this:

:distribution1 a dcat:Distribution ;
    dcat:accessURL <https://data.cssz.cz/dataset/ciselnik-datovych-typu> ;
    dcat:downloadURL <https://data.cssz.cz/dump/ciselnik-datovych-typu.trig> ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/RDF_TRIG> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/application/trig> .
:distribution2 a dcat:Distribution ;
   dcat:accessURL <https://data.cssz.cz/sparql> ;
   dcat:accessService :dataservice .

e.g. a distribution pointing not to a file, but to a service? If so, what about properties, which are defined both on the level of Distribution and DataService? We can specify them at both places, and the resulting meaning is not clear, e.g. conflicting licenses or accessRights on the Distribution and DataService.

Then again, in my opinion, a DataService is really just a means of accessing representations of Datasets, therefore, I see it more on the same level with a Distribution rather than on the level of Datasets. However, the fact that both Datasets and DataServices inherit from dcat:Resource, but dcat:Distributions do not, would suggest otherwise. Does the WG envision here that e.g. open data portals start cataloguing Data Services similarly to Datasets, e.g. as first-class citizens of Catalogs, instead of using them at the level of Distributions, i.e. entities dependent on Datasets?

Finally, it seems a bit confusing that Data Service serves datasets, but it is Distributions of datasets, which are accessed using a Data Service. This may, however, be connected to the point above.

The only guidance I found in the document is at the end of 6.7:

Links between a dcat:Distribution and services or Web addresses where it can be accessed are expressed using dcat:accessURL, dcat:accessService, dcat:downloadURL, as shown in Figure 1 and described in the definitions below.

which, btw, seems like a weird sentence.

Any thoughts on this? Am I missing something or overthinking this?

riccardoAlbertoni commented 4 years ago

Hi @jakubklimek, thanks for your comments, please see my replies below.

Should it be that the Dataset actually has 2 distributions like this:

:distribution1 a dcat:Distribution ;
    dcat:accessURL <https://data.cssz.cz/dataset/ciselnik-datovych-typu> ;
    dcat:downloadURL <https://data.cssz.cz/dump/ciselnik-datovych-typu.trig> ;
    dcterms:format <http://publications.europa.eu/resource/authority/file-type/RDF_TRIG> ;
    dcat:mediaType <http://www.iana.org/assignments/media-types/application/trig> .
:distribution2 a dcat:Distribution ;
   dcat:accessURL <https://data.cssz.cz/sparql> ;
   dcat:accessService :dataservice .

e.g. a distribution pointing not to a file, but to a service?

Yes, I would connect SPARQL endpoint to a different distribution than the trig file. Something of similar is illustrated in EXAMPLE 45.

If so, what about properties, which are defined both on the level of Distribution and DataService? We can specify them at both places, and the resulting meaning is not clear, e.g. conflicting licenses or accessRights on the Distribution and DataService.

Distribution and DataService are two distinct things, and there might be cases where you need to specify a licence for both. For example, the same SPARQL endpoint might serve distributions related to different datasets. I think this is the quite classical case for SPARQL endpoints. And that is one reason why they have been modelled separately.

In DCAT2, we have been quite liberal about what must be used in what circumstance, in particular, I think that inherited properties can be used only if needed. There are different rules of thumb, that might be adopted, e.g., "when the service is connected to a distribution put the license at the distribution level only". This seems an easy way to avoid inconsistencies but different catalogues/communities might want to use other rules, e.g., " put a license on both distributions and related services but ensure that they are compatible".

I think the modelling here has been done having in mind that there are a quite extended set of cases. We tried to be as general as possible, as DCAT 2 is expected to be specialized by a wide range of communities. This comes at the cost that the communities might need to decide how to profile and which guidance best fit for them in order to maintain consistency.

Then again, in my opinion, a DataService is really just a means of accessing representations of Datasets, therefore, I see it more on the same level with a Distribution rather than on the level of Datasets. However, the fact that both Datasets and DataServices inherit from dcat:Resource, but dcat:Distributions do not, would suggest otherwise. Does the WG envision here that e.g. open data portals start cataloguing Data Services similarly to Datasets, e.g. as first-class citizens of Catalogs, instead of using them at the level of Distributions, i.e. entities dependent on Datasets?

Yes, services are promoted to first-class citizens, though the focus is primarily on service to provide access, DataServices include data processing functions. And there are some examples of DataServices that are not connected to Distributions in the DCAT document (EXAMPLE 48 shows a discovery service for a catalogue). In those cases, we might want to specify licenses for the services Independently from the fact that they are directly connected to Distributions or Datasets.

Finally, it seems a bit confusing that Data Service serves datasets, but it is Distributions of datasets, which are accessed using a Data Service. This may, however, be connected to the point above.

Not sure to understand here. Anyway, It might help to note Example 49 , which shows some DataServices not connected to Distributions of the Dataset they serve.

The only guidance I found in the document is at the end of 6.7:

Links between a dcat:Distribution and services or Web addresses where it can be accessed are expressed using dcat:accessURL, dcat:accessService, dcat:downloadURL, as shown in Figure 1 and described in the definitions below.

which, btw, seems like a weird sentence.

It just warns that guidance is provided in the subsections describing dcat:accessURL, dcat:accessService, dcat:downloadURL. Further explanations are provided in the examples I have mentioned before.

Any thoughts on this? Am I missing something or overthinking this?

Do my comments and the examples I have pointed to help in putting the pieces together?

jakubklimek commented 4 years ago

@riccardoAlbertoni Thank you, this definitely clarifies things.

However, to avoid future confusion by other readers, I would suggest adding some clarification of these issues to the document itself. Maybe an example more concise than 48 and 49 (which are quite extensive and therefore lack clarity).

Also this:

Yes, I would connect SPARQL endpoint to a different distribution than the trig file. Something of similar is illustrated in EXAMPLE 45.

should be added more prominently, not just given in an example, as I think this will be a major use case in open data portals.

andrea-perego commented 4 years ago

To be moved to future work, as it cannot be addressed at this stage.

riccardoAlbertoni commented 4 years ago

+1 to move it on future work