Data model for results - Githubissues

This is a proposal for what the dataset summaries could look like. This proposal is based on https://www.w3.org/TR/void/#statistics.

Dataset summaries

Size

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:distinctSubjects 53434; 
    void:distinctObjects 32323;
    void:properties 943;
    void:entities 8493. # To be an entity in a dataset, a resource must have a URI, and the URI must match the dataset's void:uriRegexPattern, if any.

Classes

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:VisualArtWork;
        void:entities 312000;
    ],
    [
        void:class schema:Person;
        void:entities 980;
    ].

Properties

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:propertyPartition [
        void:property schema:name;
        void:triples 203000;
    ],
    [
        void:property schema:birthDate;
        void:triples 19312;
    ].

Property density per subject type

Nest a void:propertyPartition in void:classPartition:


<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:Person;
        void:propertyPartition [
            void:property schema:name;
            void:triples 155;
        ],
        [
            void:property schema:birthDate;
            void:triples 76;
        ]
    ],
    [
        void:class schema:VisualArtWork;
        void:propertyPartition [
            void:property schema:name;
            void:triples 1200;
        ],
        [
            void:property schema:image;
            void:triples 52;
        ]
    ].

Outgoing links

We could model these as void:Linksets:

[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <http://data.bibliotheken.nl/id/dataset/persons>;
    void:subset <http://data.bibliotheken.nl/id/dataset/rise-alba>; # The dataset that contains the links.
    void:triples 434 .
[] a void:Linkset;
    void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
    void:objectsTarget <https://data.cultureelerfgoed.nl/term/id/cht>;
    void:triples 9402.

Use a list of fixed URI prefixes to match against, from the Network of Terms and in addition a custom list in the pipeline itself.

Vocabularies

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:vocabulary <https://schema.org/>, <http://www.w3.org/2000/01/rdf-schema#>, <http://xmlns.com/foaf/0.1/>.

Example resources

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:exampleResource <http://data.bibliotheken.nl/doc/alba/p418213178>, 
        <http://data.bibliotheken.nl/doc/alba/p416673600>.

Provenance

Where should we place provenance information about the analysis results? PROV-O suggests using prov:Entity for analyses. We can track provenance either:

at the level of void:Dataset by declaring each dataset to be a prov:Entity too, which is rather vague;
or, more precisely, at the level of each partition (itself a void:Dataset).

<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
    void:classPartition [
        void:class schema:VisualArtWork;
        void:entities 312000;
        a void:Dataset, prov:Entity;
        prov:wasGeneratedBy [
            a prov:Activity ;
            prov:used "SELECT DISTINCT ?type (COUNT(?type) as ?number) (…)";
        ];
        prov:generatedAtTime "2022-05-03T13:35:23Z"^^xsd:dateTime;
    ].

Questions

[ ] Are there other relevant prov properties that we should add?
[ ] Are these data structures easy enough to query for clients?
[x] Does void make sense or should we (also) use schema where possible, e.g. schema:workExample?
[x] Are void:Linksets a good idea or should we have a simple list of source/count pairs?

netwerk-digitaal-erfgoed / dataset-knowledge-graph

Data model for results #11