richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

Wikidata TrID results do not have the same provenance metadata as other results resulting in large numbers of linting messages #160

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Working out the fix for https://github.com/richardlehane/siegfried/issues/153 I am seeing a lot of Wikidata linting messages appear for the new TrID patterns. It is because the SPARQL for provenance expected more uniformity.

We need:

        optional { ?object prov:wasDerivedFrom ?provenance;
           optional { ?provenance pr:P248 ?reference. }
           optional { ?provenance pr:P813 ?date. }

but were using:

        optional { ?object prov:wasDerivedFrom ?provenance;
           optional { ?provenance pr:P248 ?reference;
                                  pr:P813 ?date.
                    }
        }

This has the unfortunate result of being an incomplete graph if ?date isn't available for the provenance for a record. E.g. the record for Gherkin files.

We'll change the SPARQL to the above and that should be okay but need to verify.