netwerk-digitaal-erfgoed / dataset-knowledge-graph

Pipeline that generates the NDE Dataset Knowledge Graph
European Union Public License 1.2
2 stars 0 forks source link

Data model for results #11

Closed ddeboer closed 1 year ago

ddeboer commented 2 years ago

This is a proposal for what the dataset summaries could look like. This proposal is based on https://www.w3.org/TR/void/#statistics.

Dataset summaries

Size

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:distinctSubjects 53434; 
    void:distinctObjects 32323;
    void:properties 943;
    void:entities 8493. # To be an entity in a dataset, a resource must have a URI, and the URI must match the dataset's void:uriRegexPattern, if any. 

Classes

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:VisualArtWork;
        void:entities 312000;
    ],
    [
        void:class schema:Person;
        void:entities 980;
    ].

Properties

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:propertyPartition [
        void:property schema:name;
        void:triples 203000;
    ],
    [
        void:property schema:birthDate;
        void:triples 19312;
    ].

Property density per subject type

Nest a void:propertyPartition in void:classPartition:


<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:Person;
        void:propertyPartition [
            void:property schema:name;
            void:triples 155;
        ],
        [
            void:property schema:birthDate;
            void:triples 76;
        ]
    ],
    [
        void:class schema:VisualArtWork;
        void:propertyPartition [
            void:property schema:name;
            void:triples 1200;
        ],
        [
            void:property schema:image;
            void:triples 52;
        ]
    ].

Outgoing links

We could model these as void:Linksets:

[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <http://data.bibliotheken.nl/id/dataset/persons>;
    void:subset <http://data.bibliotheken.nl/id/dataset/rise-alba>; # The dataset that contains the links.
    void:triples 434 .
[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <https://data.cultureelerfgoed.nl/term/id/cht>;
    void:triples 9402.

Use a list of fixed URI prefixes to match against, from the Network of Terms and in addition a custom list in the pipeline itself.

Vocabularies

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:vocabulary <https://schema.org/>, <http://www.w3.org/2000/01/rdf-schema#>, <http://xmlns.com/foaf/0.1/>.

Example resources

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:exampleResource <http://data.bibliotheken.nl/doc/alba/p418213178>, 
        <http://data.bibliotheken.nl/doc/alba/p416673600>.

Provenance

Where should we place provenance information about the analysis results? PROV-O suggests using prov:Entity for analyses. We can track provenance either:

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:VisualArtWork;
        void:entities 312000;
        a void:Dataset, prov:Entity;
        prov:wasGeneratedBy [
            a prov:Activity ;
            prov:used "SELECT DISTINCT ?type (COUNT(?type) as ?number) (…)";
        ];
        prov:generatedAtTime "2022-05-03T13:35:23Z"^^xsd:dateTime;
    ].

Questions

EnnoMeijers commented 2 years ago

Thanks David, nice work! I suggest to discuss this setup and your questions more broadly our team meeting next Tuesday, I will put it on the agenda.