NDE Dataset Knowledge Graph Pipeline

The NDE Dataset Knowledge Graph helps researchers, software developers and others to find relevant datasets for their projects. It consists of Dataset Summaries that provide statistical information about datasets.

This repository is the data pipeline that generates the Knowledge Graph.

Finding datasets

To query the Knowledge Graph, use the SPARQL endpoint at https://triplestore.netwerkdigitaalerfgoed.nl/repositories/dataset-knowledge-graph.

Some example queries (make sure to select repository dataset-knowledge-graph on the top right):

links from datasets to terminology sources
property partitions per class
percentage of URI objects vs literals%20as%20%3FpercentageURIObjects)%20%7B%0A%20%20%20%20%3Fdataset%20a%20void%3ADataset%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20nde%3AdistinctObjectsLiteral%20%3FnumberOfLiteralObjects%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20nde%3AdistinctObjectsURI%20%3FnumberOfURIObjects%20%3B%0A%7D%0A)

This datastory shows more queries against the Knowledge Graph.

Approach

The Knowledge Graph contains Dataset Summaries that answer questions such as:

which RDF types are used in the dataset?
for each of those types, how many resources does the dataset contain?
which predicates are used in the dataset?
for each of those predicates, how many subjects have it?
similarly, how many subjects of each type have the predicate?
which URI prefixes does the dataset link to?
for each of those prefixes, which match known terminology sources?
for each of those sources, how many outgoing links to them does the dataset have?
(and more)

The Summaries can be consulted by users such as data platform builders to help them find relevant datasets.

It is built on top of the Dataset Register, which contains dataset descriptions as supplied by their owners. Part of these descriptions are distributions, i.e. URLs where the data can be retrieved.

To build the Summaries, the Knowledge Graph Pipeline applies SPARQL queries against RDF distributions, either directly in case of SPARQL endpoints or by loading the data first in case of RDF data dumps. Where needed, the SPARQL results are post-processed in code.

Scope

This pipeline:

is RDF-based so will be limited to datasets that provide at least one valid RDF distribution;
will skip RDF distributions that contain invalid data.

Dataset Summaries

The pipeline produces a set of Dataset Summaries. VoID is used as the data model for these Summaries.

Size

The overall size of the dataset: the number of unique subjects, predicates and literal as well as URI objects.

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:triples 6119677;
    void:distinctSubjects 53434; 
    void:properties 943;
    nde:distinctObjectsLiteral 2125;
    nde:distinctObjectsURI 32323.

Classes

The RDF subject classes that occur in the dataset, and for each class, the number of instances.

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:VisualArtWork;
        void:entities 312000;
    ],
    [
        void:class schema:Person;
        void:entities 980;
    ].

Properties

The predicates that occur in the dataset, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:propertyPartition [
        void:property schema:name; 
        void:entities 203000;         # 20.300 resources have a schema:name.
        void:distinctObjects 20000;   # These resources have a total of 20.000 unique names.   
    ],
    [
        void:property schema:birthDate;
        void:entities 19312;
        void:distinctObjects 19312;
    ].

Property density per subject class

The predicates per subject class, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.

Nest a void:propertyPartition in void:classPartition:

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:Person;
        void:propertyPartition [
            void:property schema:name;   # This partition is about schema:Persons with a schema:name.
            void:entities 155;           # 155 persons have a name.
            void:distinctObjects 205;    # These 155 persons have a total of 205 unique names, because some persons have multiple names.
        ],
        [
            void:property schema:birthDate;
            void:entities 76;
            void:distinctObjects 76;
        ]
    ],
    [
        void:class schema:VisualArtWork;
        void:propertyPartition [
            void:property schema:name;
            void:entities 1200;
            void:distinctObjects 1200;
        ],
        [
            void:property schema:image;
            void:entities 52;
            void:distinctObjects 20;
        ]
    ].

Outgoing links to terminology sources

Outgoing links to terminology sources in the Network of Terms, modelled as void:Linksets:

[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <http://data.bibliotheken.nl/id/dataset/persons>;
    void:triples 434 .
[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <https://data.cultureelerfgoed.nl/term/id/cht>;
    void:triples 9402.

Uses a list of fixed URI prefixes to match against, from the Network of Terms and in addition a custom list in the pipeline itself.

Vocabularies

The vocabularies that the dataset’s predicates refer to:

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:vocabulary <http://schema.org>, <http://xmlns.com/foaf/0.1/>.

Licenses

Licenses that apply to resources in the dataset.

<https://example.com/dataset> a void:Dataset;
    void:subset [
        dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
        void:triples 120.
    ],
    [
        dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
        void:triples 120.
    ].

Distributions

All declared RDF distributions are validated:

SPARQL endpoints are tested with a simple SELECT * { ?s ?p ?o } LIMIT 1 query;
RDF data downloads are tested with an HTTP HEAD request.

If the distributions are valid, they are stored in void:sparqlEndpoint and/or void:dataDump triples:

<https://lod.uba.uva.nl/UB-UVA/Books>
    void:sparqlEndpoint <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/> ;
    void:dataDump <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?> .

The Schema.org ontology is used to supplement VoID in providing additional details about the distributions, retrieved from the HTTP HEAD response, if available:

<https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?> 
    <https://schema.org/dateModified> "2023-11-03T23:55:38.000Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
    <https://schema.org/contentSize> 819617127.

[] a <https://schema.org/Action>;
    <https://schema.org/target> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>;
    <https://schema.org/result> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>.

[] a <https://schema.org/Action>;
    <https://schema.org/target> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>;
    <https://schema.org/result> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>.

If a distribution is invalid, a schema:error triple will indicate the HTTP status code:

[] a <https://schema.org/Action>;
    <https://schema.org/target> <https://www.openarchieven.nl/foundlinks/linkset/33ff3fa4744db564807b99dbc4a3d012.nt.gz>;
    <https://schema.org/error> <https://www.w3.org/2011/http-statusCodes#NotFound>.

Example resources

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:exampleResource <http://data.bibliotheken.nl/doc/alba/p418213178>, 
        <http://data.bibliotheken.nl/doc/alba/p416673600>.

Run the pipeline

To run the pipeline yourself, start by cloning this repository. Then execute:

npm install
npm run dev

The Dataset Summaries output will be written to the output/ directory.

Pipeline Steps

The pipeline consists of the following steps.

1. Select

Select dataset descriptions with RDF distributions from the Dataset Register.

2. Load

If the dataset has no SPARQL endpoint distribution, load the data from an RDF dump distribution, if available.

3. Analyze

Apply Analyzers, either to the dataset provider’s SPARQL endpoint, or our own where we loaded the data. Analyzers are SPARQL CONSTRUCT queries, wrapped in code where needed to extract more detailed information. Analyzers output results as triples in the VoID vocabulary.

4. Write analysis results

Write the analysis results to local files and a triple store.

netwerk-digitaal-erfgoed / dataset-knowledge-graph

readme