netwerk-digitaal-erfgoed / requirements-datasets

Requirements for datasets
https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/
1 stars 0 forks source link

NDE dataset characteristic #28

Open coret opened 3 years ago

coret commented 3 years ago

The NDE Dataset Register is intended for descriptions of heritage datasets (datadump, API's, etc.). Currently the only mechanism employed in the register to "filter dataset descriptions" is an "access list" with trusted organisations (currently a separate, public graph). When you trust a domain which publishes both heritage and non-heritage datasets, this could lead to 'contamination' of the NDE Dataset Register.

A solution could be to require a "NDE dataset characteristic" in the dataset description. Candidate properties for this characteristic could be theme and keyword . Example triple: <http://example.com/dataset/1> schema:keyword "nde-dataset"

This would mean a breaking change to the dataset description requirements. Not only will the chosen property become required, a specific string in the chosen property would become required. This will "raise the bar" for valid dataset descriptions.

This breaking change should be communicated to dataset supplying organisations and their IT suppliers prior to implementation.

This NDE dataset characteristic requirement might makes it harder to get some (international) dataset descriptions in the NDE Dataset Register, like http://vocab.getty.edu/aat and Wikidata. In particular this might be a showstopper for the use of the NDE Dataset Register by the Network of Terms (@sdevalk @EnnoMeijers ?)

EnnoMeijers commented 3 years ago

Mind you that dataset descriptions do not necessarily have to be published by the publisher of a dataset itself. For datasets coming from outside the NDE network additional dataset descriptions can be published by an NDE organization when regarded relevant for the use within the network. In this case a third party functions as a proxy for the dataset description. We might consider to define a limited group of organizations that we allow to perform as proxy. The maintainer of the registry would be logical candidate for this role.

For the breaking part: we might consider a grace period and assume for the time being that all proposed datasets descriptions are of type 'nde-dataset' until sufficient uptake is in place to require the presence of the this keyword explicitly. In the time being we can generate a warning that states that this will be a required property in the near future.

bencomp commented 2 years ago

What makes a dataset a heritage dataset? Should you maybe look at the contents of the dataset instead of a specific tag to determine if it's suitable for the registry? A unknowing actor (or even bad actor) could add a triple to any dataset and bypass the suggested filter.

coret commented 2 years ago

What makes a dataset a heritage dataset?

Currently the thinking leans to datasets of heritage organisations. But this is too limited, consider Wikidata and the Kadaster. Wouldn't classify these as heritage organisations, but they have interesting datasets (clearly .. somewhat) related to heritage.

For example, take the BRK (Basisregistratie Kadaster). Not really a heritage dataset, but a local Timemachine project might use this data to find plots which have not changed since 1832 (this is a real usecase).

Maybe the risk of "contamination" is negligible. The purpose of the Dataset Register is to increase the findability of heritage datasets. Maybe the "non-heritage" datatases could be perceived as noise to most, but "gold" to some.

EnnoMeijers commented 2 years ago

In addition to Bob's answer: We are also working on a pipeline that analyses the contents of (linked) datasets and writes summaries with characteristics to our Knowledge Graph. The KG can be queried to discover relevant datasets for further processing and building end user services. On the level of the dataset registry we like to keep the metadata straightforward so institutes can easily provide it.