netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register
https://datasetregister.netwerkdigitaalerfgoed.nl/api/
European Union Public License 1.2
4 stars 3 forks source link

Incomplete number of Open Archives datasets #831

Closed coret closed 3 weeks ago

coret commented 7 months ago

The datacatalog https://www.openarchieven.nl/datasets/ (registration URL https://www.openarchieven.nl/.well-known/datacatalog or via direct link https://www.openarchieven.nl/datasets/datacatalog.ttl) contains 91 datasets. Yet, only 6 are present in the Dataset Register?

Check via query:

PREFIX dct: <http://purl.org/dc/terms/>
SELECT * WHERE {
    ?dataset dct:isPartOf "https://www.openarchieven.nl/datasets/" .
}
ddeboer commented 7 months ago

In the logs it says:

{"level":50,"time":1701406951923,"pid":24,"hostname":"registry-crawler-54f948cd59-vzw2s","msg":"SPARQL query result for https://www.openarchieven.nl/.well-known/datacatalog reached the SPARQL limit of 50000"}
coret commented 7 months ago

@ddeboer Which component throws this error, Comunica?

The command curl -L -H "Accept: application/n-triples" https://www.openarchieven.nl/.well-known/datacatalog gives 7349 N-triples. Where does the > 50000 (what?) come from?

I see in the logs a start item and an error, with more than 2 minutes in between? What is taking so long?

ddeboer commented 7 months ago

https://github.com/netwerk-digitaal-erfgoed/dataset-register/blob/8d8082beb5a1683e89009582e7675668ba73c568/src/query.ts#L105

coret commented 7 months ago

As the provider of this dataset I still do not understand why this (undocumented) limit is reached, given that curl -L -H "Accept: application/n-triples" https://www.openarchieven.nl/.well-known/datacatalog gives 7349 triples.

The dataset requirements only mention the number datasets after which pagination should be used:

Therefore, publishers SHOULD split large data catalogs in parts of at most a 1000 datasets, using the Hydra Core Vocabulary.

But, the Open Archives datacatalog only contains 91 datasets. So do I need to alter the datacatalog/descriptions in some way or is the issue in https://github.com/netwerk-digitaal-erfgoed/dataset-register/blob/8d8082beb5a1683e89009582e7675668ba73c568/src/fetch.ts#L74

ddeboer commented 7 months ago

50.000 is the limit on the number of result bindings, not the number of triples. It’s good practice to have some limit on your SPARQL queries, although of course we could raise this to another (arbitrary) number.

Querying just a single dataset gives a ridiculous number of bindings, perhaps due to multi-lingual labels, distribution blank nodes, OPTIONALs and/or bugs in Comunica: ade.json (18 MB!).

Should we perhaps consider splitting into two stages?

  1. Identify all dataset URIs.
  2. For each individual dataset URI, execute our query.

That would of course mean ~8000 separate queries in the case of NA.

coret commented 7 months ago

Querying just a single dataset gives a ridiculous number of bindings, perhaps due to multi-lingual labels, distribution blank nodes, OPTIONALs and/or bugs in Comunica: ade.json (18 MB!).

I've analyzed the ade.json file and have encountered 10.752 variants of the datasetdescription (see https://validator.schema.org/#url=https%3A%2F%2Fwww.openarchieven.nl%2Fdatasets%2Fade for easy look at source).

The datasetdescription has:

The ade.json seems to be some kind of Cartesian product of all these "multiple value" property values (multilanguage or "arrays" like distributions and keywords): 14 6 2 2 2 2 2 2 2 = 10.752

Hope this analysis makes sense and leads to identifying the bug!

coret commented 7 months ago

The following query via Comunica has output as expected, no problem:

$ comunica-sparql https://www.openarch.nl/.well-known/datacatalog "CONSTRUCT WHERE { <https://www.openarchieven.nl/id/dataset_ade> ?p ?o }"
<https://www.openarchieven.nl/id/dataset_ade> a <http://schema.org/Dataset>;
    <http://schema.org/name> "Dataset genealogische metadata Archief Delft via Open Archieven"@nl, "Dataset genealogical metadata Archive Delft via Open Archives"@en;
    <http://schema.org/publisher> <https://www.openarchieven.nl/>;
    <http://schema.org/creator> <https://www.openarchieven.nl/>;
    <http://schema.org/dateCreated> "2023-02-22"^^<http://schema.org/Date>;
    <http://schema.org/dateModified> "2023-02-23"^^<http://schema.org/Date>;
    <http://schema.org/description> "De open data bestaat uit de metadata van 853.880 akten van Archief Delft, met daarop 2.299.475 historische persoonsvermeldingen. De brontypes omvatten bevolkingsregisters, geboorten, huwelijken, overlijdens. Deze dataset kan doorzocht worden via https://www.openarchieven.nl/ade"@nl, "The open data consists of metadata from 853,880 records of Archive Delft, with 2.299.475 historical person observations. The source types included population registers, births, marriages, deaths. This dataset can be searched via https://www.openarchieven.nl/ade"@en;
    <http://schema.org/distribution> _:bc_0_b0_genid-19, _:bc_0_b0_genid-210, _:bc_0_b0_genid-311, _:bc_0_b0_genid-412, _:bc_0_b0_genid-513, _:bc_0_b0_genid-614;
    <http://schema.org/identifier> "https://www.openarchieven.nl/id/dataset_ade";
    <http://schema.org/inLanguage> "nl-NL";
    <http://schema.org/includedInDataCatalog> "https://www.openarchieven.nl/datasets/";
    <http://schema.org/isBasedOn> <https://www.stadsarchiefdelft.nl/collecties/open-data/>;
    <http://schema.org/keywords> "Open Archieven", "Historische persoonsvermeldingen", "Genealogie", "Bevolkingsregisters", "Geboorten", "Huwelijken", "Overlijdens", "Open Archives", "Historical personal data", "Genealogy", "Population registers", "Births", "Marriages", "Deaths";
    <http://schema.org/license> <http://creativecommons.org/publicdomain/zero/1.0/>;
    <http://schema.org/mainEntityOfPage> <https://www.openarchieven.nl/datasets/ade>;
    <http://schema.org/spatialCoverage> "Nederland"@nl, "Netherlands"@en;
    <http://schema.org/thumbnailUrl> <https://www.openarchieven.nl/img/search/ade-oa-nl.png>.

The following query seems to have a cartesian product like result (6.696 triples):

$ comunica-sparql https://www.openarch.nl/.well-known/datacatalog "CONSTRUCT WHERE { <https://www.openarchieven.nl/id/dataset_ade> ?p ?o ; <http://schema.org/distribution> ?d . ?d ?e ?f .}"

The same query on GraphDB (repository oa-datacatalog) gives 100 triples.