Open Aklakan opened 5 years ago
Thanks for your proposal, @Aklakan .
Indeed, DCAT 2 is frozen, so we are assigning this to future work.
A quick example to clarify what I mean by the HTTP content negotiation aspect:
Let's say there is a DCAT catalog on the Web with an n-triple and turtle distribution
my:dataset
a cat:Dataset ;
cat:distribution my:dist-as-ttl, my:dist-as-nt .
my:dist-as-ttl a cat:Distribution ;
dc:format "application/turtle" ;
cat:downloadURL <https://gitlab.com/.../demo.ttl> .
my:dist-as-nt a cat:Distribution ;
dc:format "application/n-triples" ;
cat:downloadURL <https://gitlab.com/.../demo.nt> .
Then I would assume that if someone wrote a DCAT HTTP server that can serve datasets based on DCAT (I call that a data node), that a client could do:
curl -X POST \
-H 'Accept: application/n-triples \
'http://localhost/my-datanode?id=my:dataset`
And the data node would choose the appropriate distribution from it:
HTTP/1.1 200 OK
Date: Fri, 11 Oct 2019 19:49:09 GMT
Content-Type: application/n-triples; charset=utf-8
Content-Location: https://gitlab.com/.../demo.nt <--- ntriples served
So this establishes quite a strong link between DCAT and HTTP conneg. I think this is very reasonable behaviour that should be specified in the DCAT spec (or a related one, like DCAT-HTTP). But maybe I am overlooking something, so I'd gladly get opinions on that :)
Of course there are forseeable subtleties, which a data node has to handle, such as avoiding sending out content locations that cause a HTTP 506 Variant Also Negotiates.
I understand that DCAT 2 content is frozen, so this is a feature request to be considered for a future version.
While working with DCAT data catalogs I came across this challenge: The link between datasets and distributions seems to be used pretty much arbitrarily in practice. For example, picking an arbitrary entry from data.gov, I can see a zip file, web resources, REST endpoint. In the typical CKAN-DCAT mapping, all these resources become distributions and my impression is, that the DCAT 2 standard does (intentionally?) not impose many restrictions here. Of course, a little semantic goes a long way, but after nearly 2 decades of Semantic Web, I think many people in the RDF community want to go a bit further.
And with this lax modeling, it is impossible for application to refer to a (DCAT) dataset and to have it do something smart with it.
So what is a dataset in the first place? There is 5.1 DCAT scope which states
I would like to make the following proposal:
Dataset descriptions that adhere to these rules, can be unambigously served according the HTTP principles, notably content negotiation, by a DCAT-based HTTP proxy.
As I see it, there is a strong link between how HTTP functions and how datasets - according to the strict definition - correspond to HTTP resources that thus can be served in a standard way based on catalog metadata. This aspect is in my impression not yet adequately considered in the DCAT spec.