The NDE Dataset Knowledge Graph helps researchers, software developers and others to find relevant datasets for their projects. It consists of Dataset Summaries that provide statistical information about datasets.
This repository is the data pipeline that generates the Knowledge Graph.
To query the Knowledge Graph, use the SPARQL endpoint at
https://triplestore.netwerkdigitaalerfgoed.nl/repositories/dataset-knowledge-graph
.
Some example queries (make sure to select repository dataset-knowledge-graph
on the top right):
This datastory shows more queries against the Knowledge Graph.
The Knowledge Graph contains Dataset Summaries that answer questions such as:
The Summaries can be consulted by users such as data platform builders to help them find relevant datasets.
It is built on top of the Dataset Register, which contains dataset descriptions as supplied by their owners. Part of these descriptions are distributions, i.e. URLs where the data can be retrieved.
To build the Summaries, the Knowledge Graph Pipeline applies SPARQL queries against RDF distributions, either directly in case of SPARQL endpoints or by loading the data first in case of RDF data dumps. Where needed, the SPARQL results are post-processed in code.
This pipeline:
The pipeline produces a set of Dataset Summaries. VoID is used as the data model for these Summaries.
The overall size of the dataset: the number of unique subjects, predicates and literal as well as URI objects.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:triples 6119677;
void:distinctSubjects 53434;
void:properties 943;
nde:distinctObjectsLiteral 2125;
nde:distinctObjectsURI 32323.
The RDF subject classes that occur in the dataset, and for each class, the number of instances.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:classPartition [
void:class schema:VisualArtWork;
void:entities 312000;
],
[
void:class schema:Person;
void:entities 980;
].
The predicates that occur in the dataset, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:propertyPartition [
void:property schema:name;
void:entities 203000; # 20.300 resources have a schema:name.
void:distinctObjects 20000; # These resources have a total of 20.000 unique names.
],
[
void:property schema:birthDate;
void:entities 19312;
void:distinctObjects 19312;
].
The predicates per subject class, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.
Nest a void:propertyPartition
in void:classPartition
:
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:classPartition [
void:class schema:Person;
void:propertyPartition [
void:property schema:name; # This partition is about schema:Persons with a schema:name.
void:entities 155; # 155 persons have a name.
void:distinctObjects 205; # These 155 persons have a total of 205 unique names, because some persons have multiple names.
],
[
void:property schema:birthDate;
void:entities 76;
void:distinctObjects 76;
]
],
[
void:class schema:VisualArtWork;
void:propertyPartition [
void:property schema:name;
void:entities 1200;
void:distinctObjects 1200;
],
[
void:property schema:image;
void:entities 52;
void:distinctObjects 20;
]
].
Outgoing links to terminology sources in the Network of Terms,
modelled as void:Linkset
s:
[] a void:Linkset;
void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
void:objectsTarget <http://data.bibliotheken.nl/id/dataset/persons>;
void:triples 434 .
[] a void:Linkset;
void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
void:objectsTarget <https://data.cultureelerfgoed.nl/term/id/cht>;
void:triples 9402.
Uses a list of fixed URI prefixes to match against, from the Network of Terms and in addition a custom list in the pipeline itself.
The vocabularies that the dataset’s predicates refer to:
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:vocabulary <http://schema.org>, <http://xmlns.com/foaf/0.1/>.
Licenses that apply to resources in the dataset.
<https://example.com/dataset> a void:Dataset;
void:subset [
dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
void:triples 120.
],
[
dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
void:triples 120.
].
All declared RDF distributions are validated:
SELECT * { ?s ?p ?o } LIMIT 1
query;If the distributions are valid, they are stored in void:sparqlEndpoint
and/or void:dataDump
triples:
<https://lod.uba.uva.nl/UB-UVA/Books>
void:sparqlEndpoint <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/> ;
void:dataDump <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?> .
The Schema.org ontology is used to supplement VoID in providing additional details about the distributions, retrieved from the HTTP HEAD response, if available:
<https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>
<https://schema.org/dateModified> "2023-11-03T23:55:38.000Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
<https://schema.org/contentSize> 819617127.
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>;
<https://schema.org/result> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>.
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>;
<https://schema.org/result> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>.
If a distribution is invalid, a schema:error
triple will indicate the HTTP status code:
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://www.openarchieven.nl/foundlinks/linkset/33ff3fa4744db564807b99dbc4a3d012.nt.gz>;
<https://schema.org/error> <https://www.w3.org/2011/http-statusCodes#NotFound>.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:exampleResource <http://data.bibliotheken.nl/doc/alba/p418213178>,
<http://data.bibliotheken.nl/doc/alba/p416673600>.
To run the pipeline yourself, start by cloning this repository. Then execute:
npm install
npm run dev
The Dataset Summaries output will be written to the output/
directory.
The pipeline consists of the following steps.
Select dataset descriptions with RDF distributions from the Dataset Register.
If the dataset has no SPARQL endpoint distribution, load the data from an RDF dump distribution, if available.
Apply Analyzers, either to the dataset provider’s SPARQL endpoint, or our own where we loaded the data. Analyzers are SPARQL CONSTRUCT queries, wrapped in code where needed to extract more detailed information. Analyzers output results as triples in the VoID vocabulary.
Write the analysis results to local files and a triple store.