w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
149 stars 47 forks source link

Dedicated support for HTTP compliant datasets #1086

Open Aklakan opened 5 years ago

Aklakan commented 5 years ago

I understand that DCAT 2 content is frozen, so this is a feature request to be considered for a future version.

While working with DCAT data catalogs I came across this challenge: The link between datasets and distributions seems to be used pretty much arbitrarily in practice. For example, picking an arbitrary entry from data.gov, I can see a zip file, web resources, REST endpoint. In the typical CKAN-DCAT mapping, all these resources become distributions and my impression is, that the DCAT 2 standard does (intentionally?) not impose many restrictions here. Of course, a little semantic goes a long way, but after nearly 2 decades of Semantic Web, I think many people in the RDF community want to go a bit further.

And with this lax modeling, it is impossible for application to refer to a (DCAT) dataset and to have it do something smart with it.

So what is a dataset in the first place? There is 5.1 DCAT scope which states

A dataset in DCAT is defined as a "collection of data, published or curated by a single agent, and available for access or download in one or more serializations or formats".

I would like to make the following proposal:

Dataset descriptions that adhere to these rules, can be unambigously served according the HTTP principles, notably content negotiation, by a DCAT-based HTTP proxy.

As I see it, there is a strong link between how HTTP functions and how datasets - according to the strict definition - correspond to HTTP resources that thus can be served in a standard way based on catalog metadata. This aspect is in my impression not yet adequately considered in the DCAT spec.

andrea-perego commented 5 years ago

Thanks for your proposal, @Aklakan .

Indeed, DCAT 2 is frozen, so we are assigning this to future work.

Aklakan commented 5 years ago

A quick example to clarify what I mean by the HTTP content negotiation aspect:

Let's say there is a DCAT catalog on the Web with an n-triple and turtle distribution

my:dataset
        a cat:Dataset ;
        cat:distribution my:dist-as-ttl, my:dist-as-nt .

my:dist-as-ttl        a                cat:Distribution ;
        dc:format "application/turtle" ;
        cat:downloadURL  <https://gitlab.com/.../demo.ttl> .

my:dist-as-nt        a                cat:Distribution ;
        dc:format "application/n-triples" ;
        cat:downloadURL  <https://gitlab.com/.../demo.nt> .

Then I would assume that if someone wrote a DCAT HTTP server that can serve datasets based on DCAT (I call that a data node), that a client could do:

curl -X POST \
  -H 'Accept: application/n-triples \
  'http://localhost/my-datanode?id=my:dataset`

And the data node would choose the appropriate distribution from it:

HTTP/1.1 200 OK
Date: Fri, 11 Oct 2019 19:49:09 GMT
Content-Type: application/n-triples; charset=utf-8
Content-Location: https://gitlab.com/.../demo.nt <--- ntriples served

So this establishes quite a strong link between DCAT and HTTP conneg. I think this is very reasonable behaviour that should be specified in the DCAT spec (or a related one, like DCAT-HTTP). But maybe I am overlooking something, so I'd gladly get opinions on that :)

Of course there are forseeable subtleties, which a data node has to handle, such as avoiding sending out content locations that cause a HTTP 506 Variant Also Negotiates.