richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

Trial and write documentation about whether custom Wikibase instructions can also be used to customize Wikidata queries #184

Closed ross-spencer closed 2 years ago

ross-spencer commented 2 years ago

This came up in the #AusPreserves meeting. If a SPARQL query can be customized then there is additional flexibility for users. Related to #183 it can also reduce the load on Wikidata during -harvest where they already have a lot of stress on their back-end to deliver results.

It dawned on me that while the Wikidata query is compiled with Siegfried, the custom Wikibase effort could potentially be used to connect to Wikidata proper but using a slightly modified query. Those instructions are here:

We'd just need to change the query to match that expected by the WDQS and make sure the URIs we connect to are correct, including port information.

Potential customization would rely on only reducing what is output from the existing SPARQL query, i.e. filtering.

Examples may be:

Recording that idea here as a potential docs improvement.

ross-spencer commented 2 years ago

This looks like it will work and will make it into the documentation. Unfortunately I am hitting up against Wikidata rate limiting today.

Connect string: roy harvest -wikidata -wikidataendpoint https://query.wikidata.org/sparql? -wikibaseurl https://www.wikidata.org/w/api.php

wikibase.json:

{
 "PronomProp": "http://www.wikidata.org/entity/Q35432091",
 "BofProp": "http://www.wikidata.org/entity/Q35436009",
 "EofProp": "http://www.wikidata.org/entity/Q1148480"
}

TrID query:

# Return all file format records from Wikidata.
#
# Custom query example:
#
# All formats must have a signature.
# All signatures must come from the TrID Q41799265 reference.
#
# NB. Keep in mind all optional fields as they increase the
# number of fields where schemas aren't consistent across entries.
#
SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig WHERE {
  ?uri (wdt:P31/(wdt:P279*)) wd:Q235557.
  OPTIONAL { ?uri wdt:P2748 ?puid. }
  OPTIONAL { ?uri wdt:P1195 ?extension. }
  OPTIONAL { ?uri wdt:P1163 ?mimetype. }
  ?uri p:P4152 ?object.
  ?object ps:P4152 ?sig;
    prov:wasDerivedFrom ?provenance.
  ?provenance pr:P248 wd:Q41799265, ?reference.  # <-- modified to return TrID only, and TrID's reference label.
  OPTIONAL { ?provenance pr:P813 ?date. }
  OPTIONAL { ?object pq:P3294 ?encoding. }
  OPTIONAL { ?object pq:P2210 ?relativity. }
  OPTIONAL { ?object pq:P4153 ?offset. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY (?uri)

Example output:

---
siegfried   : 1.9.2
scandate    : 2022-09-07T11:55:20+02:00
signature   : default.sig
created     : 2022-09-07T11:55:18+02:00
identifiers :
  - name    : 'wikidata'
    details : 'wikidata-definitions-3.0.0 (2022-09-07)'
---
filename : 'trid'
filesize : 6
modified : 2022-09-07T11:55:14+02:00
errors   :
matches  :
  - ns        : 'wikidata'
    id        : 'Q100137240'
    format    : 'VariCAD Drawing'
    URI       : 'http://www.wikidata.org/entity/Q100137240'
    permalink : 'https://www.wikidata.org/w/api.php/w/index.php?oldid=1423314911&title=Q100137240'
    mime      : 'application/octet-stream'
    basis     : 'byte match at 0, 3 (TrID)'
    warning   : 'extension mismatch'

Filter the signature file by format type, raster-graphics:

# Return all file format records from Wikidata.
#
# Custom query example:
#
# Formats must be an instance of, or subclass of raster-graphics file format.
#
#
select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig
where
{
  ?uri wdt:P31/wdt:P279* wd:Q235557.
  ?uri wdt:P31/wdt:P279* wd:Q105599390.    # <-- line added to return instance/sub-class of raster-graphics-format
  optional { ?uri wdt:P2748 ?puid.      }
  optional { ?uri wdt:P1195 ?extension. }
  optional { ?uri wdt:P1163 ?mimetype.  }
  optional { ?uri p:P4152 ?object;
    optional { ?object pq:P3294 ?encoding.   }
    optional { ?object ps:P4152 ?sig.        }
    optional { ?object pq:P2210 ?relativity. }
    optional { ?object pq:P4153 ?offset.     }
    optional { ?object prov:wasDerivedFrom ?provenance;
       optional { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
order by ?uri
---
siegfried   : 1.9.2
scandate    : 2022-09-07T12:30:19+02:00
signature   : default.sig
created     : 2022-09-07T12:30:16+02:00
identifiers :
  - name    : 'wikidata'
    details : 'wikidata-definitions-3.0.0 (2022-09-07)'
---
filename : 'trid'
filesize : 10
modified : 2022-09-07T12:29:20+02:00
errors   :
matches  :
  - ns        : 'wikidata'
    id        : 'Q1143961'
    format    : 'JBIG2'
    URI       : 'http://www.wikidata.org/entity/Q1143961'
    permalink : 'https://www.wikidata.org/w/api.php/w/index.php?oldid=1526516378&title=Q1143961'
    mime      :
    basis     : 'byte match at 0, 8 (Gary Kessler''s File Signature Table (source date: 2017-08-08))'
    warning   : 'extension mismatch'
ross-spencer commented 2 years ago

May still be some typos here and there, but documentation here (feature complete! 🤘):