Dedicated support for HTTP compliant datasets

Aklakan commented 5 years ago

I understand that DCAT 2 content is frozen, so this is a feature request to be considered for a future version.

While working with DCAT data catalogs I came across this challenge: The link between datasets and distributions seems to be used pretty much arbitrarily in practice. For example, picking an arbitrary entry from data.gov, I can see a zip file, web resources, REST endpoint. In the typical CKAN-DCAT mapping, all these resources become distributions and my impression is, that the DCAT 2 standard does (intentionally?) not impose many restrictions here. Of course, a little semantic goes a long way, but after nearly 2 decades of Semantic Web, I think many people in the RDF community want to go a bit further.

And with this lax modeling, it is impossible for application to refer to a (DCAT) dataset and to have it do something smart with it.

So what is a dataset in the first place? There is 5.1 DCAT scope which states

A dataset in DCAT is defined as a "collection of data, published or curated by a single agent, and available for access or download in one or more serializations or formats".

I would like to make the following proposal:

Definition A dataset is an instance of a data model. Note, that data model and abstract syntax are synonyms.
A distribution denotes a means for access to the specific instance of the data model
All distributions of a dataset should provide access to the same dataset. Hence, if a copy of dataset from one distribution was obtained, there is no more need to fetch further distributions. Alternatively, if one distribution of an RDF dataset (a dataset that is an instance of the RDF model) is a SPARQL endpoint, an application may prefer this distribution over the file download.
A download URL points to a resource that can supply representations whose content type are among the syntactic representations of the abstract syntax: If you have tabular data, the concrete syntaxes are denoted by the mime types e.g. text/csv or text/tab-separated-values, if you have RDF data, they may be application/turtle, application/n-triples or application/rdf+xml.
If resolution of the download URL does not provide specific HTTP headers (e.g. application/octet-steam, such as for DBpedia downloads), then interpretation of the response content type, encoding, charset and language (all standard HTTP headers) may be assumed according to the distribution's DCAT description
A zip archive by itself is typically NOT a dataset - it is simply an archive, and thus a collection of files. Without further references to standards or metadata, no application can reason about what or where is the dataset of a zip archive. A zip archive could contain a DCAT description of its own content in e.g. a dcat.ttl file in the root folder. This file could then describe all CSV, RDF, XML, whatever files in the archive.

Dataset descriptions that adhere to these rules, can be unambigously served according the HTTP principles, notably content negotiation, by a DCAT-based HTTP proxy.

The HTTP proxy internally resolves the URL requested by a client to an entry among a set of DCAT catalogs.
Based on the catalog, the server can automatically provide the appropriate HTTP headers. A *smart server can even choose the appropriate download, perform HTTP caching and convert the available syntaxes and encodings to those requested (TTL to rdf/xml, CSV to TSV or excel, etc)
Note, that HTTP already describes a mechanism for handling encoding (gzip, bzip2, brotli, etc)

As I see it, there is a strong link between how HTTP functions and how datasets - according to the strict definition - correspond to HTTP resources that thus can be served in a standard way based on catalog metadata. This aspect is in my impression not yet adequately considered in the DCAT spec.

andrea-perego commented 5 years ago

Thanks for your proposal, @Aklakan .

Indeed, DCAT 2 is frozen, so we are assigning this to future work.

Aklakan commented 5 years ago

A quick example to clarify what I mean by the HTTP content negotiation aspect:

Let's say there is a DCAT catalog on the Web with an n-triple and turtle distribution

my:dataset
        a cat:Dataset ;
        cat:distribution my:dist-as-ttl, my:dist-as-nt .

my:dist-as-ttl        a                cat:Distribution ;
        dc:format "application/turtle" ;
        cat:downloadURL  <https://gitlab.com/.../demo.ttl> .

my:dist-as-nt        a                cat:Distribution ;
        dc:format "application/n-triples" ;
        cat:downloadURL  <https://gitlab.com/.../demo.nt> .

Then I would assume that if someone wrote a DCAT HTTP server that can serve datasets based on DCAT (I call that a data node), that a client could do:

curl -X POST \
  -H 'Accept: application/n-triples \
  'http://localhost/my-datanode?id=my:dataset`

And the data node would choose the appropriate distribution from it:

HTTP/1.1 200 OK
Date: Fri, 11 Oct 2019 19:49:09 GMT
Content-Type: application/n-triples; charset=utf-8
Content-Location: https://gitlab.com/.../demo.nt <--- ntriples served

So this establishes quite a strong link between DCAT and HTTP conneg. I think this is very reasonable behaviour that should be specified in the DCAT spec (or a related one, like DCAT-HTTP). But maybe I am overlooking something, so I'd gladly get opinions on that :)

Of course there are forseeable subtleties, which a data node has to handle, such as avoiding sending out content locations that cause a HTTP 506 Variant Also Negotiates.

w3c / dxwg

Dedicated support for HTTP compliant datasets #1086