netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register
https://datasetregister.netwerkdigitaalerfgoed.nl/api/
European Union Public License 1.2
4 stars 3 forks source link

Many dangling datasets #817

Open ddeboer opened 12 months ago

ddeboer commented 12 months ago

As seen in this query, not all datasets have a rating yet. We previously thought this may be due to GraphDB crashing (https://github.com/netwerk-digitaal-erfgoed/infrastructure/issues/50) but as it turns out there’s a way bigger reason: dangling datasets whose registration URL:

There are 7552 of these dangling datasets!

Examples:

  1. https://data.spinque.com/ld/data/vangoghworldwide/datacatalog.jsonld is no longer valid, because of the

    "license": "not specified",

    While we allow strings, that should be structured in JSON-LD as "@value": … for it to pass the SHACL validation. And do we really want to allow values like this anyway?

  2. https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/Picturae/catalog-picturae-schema-1.jsonld has invalid datetimes. Fixed in https://github.com/netwerk-digitaal-erfgoed/dataset-register-entries/commit/6cd91b1d578671859cf87b7521d16fc1176fc5fb.

  3. https://data.dc4eu.nl/catalog/natag no longer contains http://data.dc4eu.nl/dataset/03b88faf-0273-4a5f-b554-8e4edb6d562e.

  4. For none of the Collectienederland datasets, including http://data.collectienederland.nl/id/dataset/nederlands-openluchtmuseum, I can find a registration URL. I guess that would be https://data.collectienederland.nl/id/datacatalog, but it has either never been registered or later removed. Re-added the catalog.

  5. I’m quite sure https://archief.nl/id/datacatalog/toegang has been registered because I checked it myself (#795) but now that registration URL has been removed. Is this us or has user admin-na done this? Re-added the catalog.

Either these catalogs later added invalid data or our SHACL got stricter in subtle ways that no prevents these errors.

@coret @faina007 @eddeheerna Please share what you know about these cases.

coret commented 12 months ago

I guess we can label 1 - 3 as users providing wrong data. The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph. The datasetproviders should be notified via mail (currently manual process).

For 4 and 5 ( 7552 dangling datasets) is issue is really bad.

About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?

<http://data.collectienederland.nl/id/dataset/nederlands-openluchtmuseum>
        <http://schema.org/dateRead>  "2022-01-18T12:03:05.129Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2022-01-19T13:02:40.944Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ....
        <http://schema.org/dateRead>  "2023-11-03T14:01:48.449Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2023-11-04T15:01:48.017Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        <http://schema.org/dateRead>  "2023-11-05T16:02:04.652Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .

I see similar patterns with other dangling datasets (with different last schema:dateRead dates), like:

<https://opendata.picturae.com/dataset/wba_a2a_na_a>
        <http://schema.org/dateRead>  "2021-11-09T11:24:19.878Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ...
        <http://schema.org/dateRead>  "2022-01-25T19:00:46.584Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .

<https://www.goudatijdmachine.nl/data/api/items/13000>
        <http://schema.org/dateRead>  "2021-10-25T14:32:06.527Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        ...
        <http://schema.org/dateRead>  "2023-11-09T22:01:12.372Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
        a                             <http://schema.org/Dataset> .

Another strange registration (lets call it 6):

<https://data.netwerkdigitaalerfgoed.nl/Peace-Palace-Library/Peace-Movement-collection/>
        <http://schema.org/about>       <https://data.netwerkdigitaalerfgoed.nl/Peace-Palace-Library/Peace-Movement-collection> ;
        <http://schema.org/datePosted>  "2023-06-19T12:39:59.752Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

No schema:Dataset or schema:EntryPoint, and (I assume because of this) no schema:dateReads.

coret commented 12 months ago

Some datasets have a lot of schema:dateReads. I guess "old" ones are only removed after a succesfull read?

ddeboer commented 12 months ago

Some datasets have a lot of schema:dateReads. I guess "old" ones are only removed after a succesfull read?

They are never removed. We keep all of them for debugging purposes.

The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph.

The crawler already does this? As you can see, old descriptions are preserved. Or do you mean something else?

Another strange registration (lets call it 6):

This one looks fine, with a dateRead of last night. And it has a rating.

About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?

Good point. So apparently the registration was removed 5 or 6 Nov, but why? And who did this? It’s around the time #814 went live, but that doesn’t remove registrations.

eddeheerna commented 12 months ago

Hoi, wat betreft punt 5. Hier hebben wij niets aan gewijzigd. Ik heb het Sjoerd gevraagd maar die weet er ook niets van.

Ed

Van: David de Boer @.> Verzonden: maandag 13 november 2023 12:53 Aan: netwerk-digitaal-erfgoed/dataset-register @.> CC: Heer, Ed de @.>; Mention @.> Onderwerp: Re: [netwerk-digitaal-erfgoed/dataset-register] Many dangling datasets (Issue #817)

The crawler should not overwrite the current datasetdescription and write a non-200 status entry to the https://demo.netwerkdigitaalerfgoed.nl/registry/registrations graph.

The crawler already does this? As you can see, old descriptions are preserved. Or do you mean something else?

About 4, I can't find a schema:Entrypoint for this dataset either. But looking at the schema:dateReads the crawler was able to read this dangling dataset up to a week ago?

Good point. So apparently the registration was removed 5 or 6 Nov, but why? And who did this? It’s around the time #814https://github.com/netwerk-digitaal-erfgoed/dataset-register/pull/814 went live, but that doesn’t remove registrations.

@eddeheernahttps://github.com/eddeheerna and @faina007https://github.com/faina007 Please inform us about point 5 above.

— Reply to this email directly, view it on GitHubhttps://github.com/netwerk-digitaal-erfgoed/dataset-register/issues/817#issuecomment-1808015874, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BAHYZ5P5HYTLYE2NGQKA4HDYEICZ3AVCNFSM6AAAAAA7HE2VP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBYGAYTKOBXGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>