Improve discovery of datacatalogs by registering well-known suffix 'datacatalog'

coret commented 3 years ago

RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server. By registering the 'datacatalog' suffix and promoting its use, the discovery of datacatalogs can be improved.

Although this proposal is not DCAT specific (eg. schema.org/DataCatalog would also benefit), we do seek support of the DCAT community for this proposal (as well as the schema.org community, therefor a similar issue has been posted at https://github.com/schemaorg/schemaorg/issues/2827).

We have drafted a text which could be included in a specification document (this is highly inspired by https://www.w3.org/TR/void/#well-known):

Discovery with well-known URI

The RFC 5785 defines a mechanism for reserving 'well-known' URIs on any Web server.

The URI /.well-known/datacatalog on any Web server is registered by this specification for a datacatalog with dataset descriptions of datasets hosted on that server. For example, on the host www.example.com, this URI would be http://www.example.com/.well-known/datacatalog.

This URI may be an HTTP redirect to the location of the actual datacatalog file. The most appropriate HTTP redirect code is 302. Clients accessing this well-known URI MUST handle HTTP redirects.

The datacatalog file accessible via the well-known URI should contain descriptions of all datasets hosted on the server. This includes any datasets that have resolvable URIs, a SPARQL endpoint, a data dump, or any other access mechanism whose URI is on the server's hostname. Datacatalogs can be described using http://www.w3.org/ns/dcat#Catalog or https://schema.org/DataCatalog.

Broad support for this proposal will help in getting the 'datacatalog' suffix registered. The registration procedure and template from Section 5.1 of RFC 5785 requires a change controller and specification document. Can this community assist in this process?

andrea-perego commented 3 years ago

Thanks for contributing this proposal, @coret .

We have discussed it during the WG call (https://www.w3.org/2021/02/03-dxwgdcat-minutes#t03), and we would like ask you if you can elaborate your use case, to better understand if this requirement falls in scope with DCAT.

We checked the issue you point to (https://github.com/netwerk-digitaal-erfgoed/registry-api/issues/36) and your spec (https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/), but we were not able to find enough information.

coret commented 3 years ago

The Dutch Digital Heritage Network (Netwerk Digitaal Erfgoed) is a partnership in the Netherlands that focuses on developing a system of national facilities and services for improving the visibility, usability, and sustainability of digital heritage. The network is open to all institutions and organisations in the digital heritage field. Together we can make the most of our digital heritage and preserve it for future generations.

One of the goals is to get a better view of the available datasets in the digital heritage field. With a better understanding datasets can be re-used and links between data(sets) can be made, Linked Open Data is important in the strategy. The "Register"-project stimulates institutions and organisations in the digital heritage field to publish their dataset descriptions (and datacatalogs) online. We formulate requirements (this is where schema.org/Dataset and DCAT Application Profiles play an important role) and educate the organisations and their IT-suppliers.

To get the datasetdescriptions (and in the long term build a knowledge graph) we have an API which organisation can use to register their datasetdescriptions. The system contains a validator (SHACL) and crawler to get (and frequently update) the datasetdescriptions (which are stored in a public triple store). This is the re-active side of our crawler. To make our crawler more pro-active in finding datasetsdescriptions, we can have our crawler check the sites of Dutch heritage organisations. But instead of spidering a whole website (like Google does), it would be more efficient if the location of the datacatalog on a website has a fixed URI. This is where the .well-known/datacatalog scheme can help.

I can imagine that in the DCAT specification, a paragraph stimulates the use of .well-known/datacatalog as a means to make datacatalogs more discoverable. This would benefit the publishers of datacatalogs and the automated usage of datacatalogs.

andrea-perego commented 3 years ago

Many thanks, @coret .

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s). Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

/cc @nicholascar , @rob-metalinkage , @aisaac : Could you please give your perspective on this use case in relation to PROF & CONNEG?

makxdekkers commented 3 years ago

Does this presuppose that a domain can host a maximum of one data catalog?

rob-metalinkage commented 3 years ago

@andrea-perego think this is largely orthogonal to connegp which allows resources to self describe alternative views rather than list different collections. A data catalogue view of the website itself would be an option to avoid having to specify a 'well known sub resource.

coret commented 3 years ago

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

coret commented 3 years ago

@rob-metalinkage where on a website could one find a data catalogue view? Is this the root of a website or can this be any URI? In the latter, well-known is a mechanism to specify a URI which redirects to the resource. well-known/datacatalogs helps machines discover datacatalogs.

coret commented 3 years ago

@andrea-perego

If I correctly understand, this well-known URI is meant to advertise any data catalogue, irrespective of their thematic content and of the used/supported metadata schema(s).

That's correct.

Should this be the case, do you plan to put in place mechanisms (besides harvesting only selected Web sites) to verify (a) if they fit into your domain and (b) if they use a metadata schema you support?

Our crawler we will be "confined" to heritage institutions and will be able to process datasetsdescriptions in DCAT 2 and schema.org/Dataset, the latter will be converted to DCAT so we can more easily query a uniform set of dataset descriptions to get insights. For the well-known/datacatalog registration I think it's wise to be not to limiting in respect to datacatalog vocabularies.

I would image that products like Google Dataset Search would also benefit from the easy discovery of datacatalogs. Google Dataset Search is of course not limited to a domain and handles schema.org/Dataset (prefered) and DCAT (limited).

rob-metalinkage commented 3 years ago

@coret - yes you could have any resource support connegp - you are correct the "well knownedness" is the issue - connegp would certainly be relevant to allow any well known location (either the site root or a known location - or both) to offer multiple different forms of data catalogue - as opposed to having many alternative well known locations for different forms and needing to poll a range of them to find one a client can use.

agreiner commented 3 years ago

@makxdekkers yes, a well-known points (redirects) to one resource (the same other well-knowns on the IANA Well-Known URIs list). But if I'm not mistaken, a dcat:Catalog can contain multiple catalogs.

How would a system know that it is encountering a data catalog that includes other data catalogs and then find those catalogs efficiently?

davebrowning commented 1 year ago

Project/Milestone modified.

Explanation: As DCAT v3 moves through review and hopefully ratification, we want to make sure that open issues and feedback that have yet to be completely addressed are properly recorded and tagged/assigned in github to both clarify their status and to help review and prioritise as a source of improvements and new requirements in future DCAT versions

w3c / dxwg

Improve discovery of datacatalogs by registering well-known suffix 'datacatalog' #1290