richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

Use Wikidata IRIs for processing fields (#1) #161

Closed ross-spencer closed 3 years ago

ross-spencer commented 3 years ago

Corrects the issue described in #153 so that language choices do not impact the processing of Wikidata signature definitions into a signature file.

Notes:

Connected to https://github.com/richardlehane/siegfried/issues/153

ross-spencer commented 3 years ago

NB. Can delete this comment once merged, but this is the text for the Wiki:


The Wikidata identifier harnesses the file format signatures in Wikidata that can be made to be compatible with Siegfried. Data can be downloaded to create a new identifier and that identifier used to scan the objects in your collection.

For developers the exposed aspects of the integration API have been documented and can be viewed on go.dev. As well as the rest of the Siegfried interfaces.

For users, then a basic understanding of the Roy tool and how to build identifiers will be helpful. Understanding Siegfried's identification capabilities will also be useful.

Overview

Given prior knowledge of Siegfried and how to configure its defaults, the commands below can be used to make use of the Wikidata integration.

Harvesting

Harvest a Wikidata signature file as follows: roy harvest -wikidata

The file which this creates can be found at $HOME/<user>/siegfried/wikidata/wikidata-definitions-<wikidata-version>.

The SPARQL query used to generate version 1.0.0 of the Wikidata identifier should be in the Wikidata SPARQL module here.

Version 2.0.0. of the identifier uses the following query:

# Return all file format records from Wikidata.
#
select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig
where
{
  ?uri wdt:P31/wdt:P279* wd:Q235557.               # Return records of type File Format.
  optional { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures proper.
  optional { ?uri wdt:P1195 ?extension. }
  optional { ?uri wdt:P1163 ?mimetype.  }
  optional { ?uri p:P4152 ?object;                 # Format identification pattern statement.
    optional { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding.
    optional { ?object ps:P4152 ?sig.        }     # We always have a signature.
    optional { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file.
    optional { ?object pq:P4153 ?offset.     }     # Offset relative to the relativity.
    optional { ?object prov:wasDerivedFrom ?provenance;
       optional { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], <<lang>>, en". }
}
order by ?uri

Changing the WikiBase URI

If you have access to another WikiBase implementation that can serve Wikidata compatible file format information you can change the URI from where the information is harvested as follows:

Changing the Wikidata results language

The Wikidata results can be returned in a different language. Where a translation is available for a file format or a format's native language was something other than English this can be useful for finding the information you need. A different language can be returned using the following command:

E.g. German (DE)

An example format with a German translation is "Microsoft Shortcut" or "Dateiverknüpfung": http://www.wikidata.org/entity/Q1109779

E.g.

filename : 'shortcut.lnk'
filesize : 8
modified : 2021-04-18T17:07:20+02:00
errors   : 
matches  :
  - ns      : 'wikidata'
    id      : 'Q1109779'
    format  : 'Dateiverknüpfung'
    URI     : 'http://www.wikidata.org/entity/Q1109779'
    mime    : 'application/x-ms-shortcut'
    basis   : 'byte match at 0, 8'
    source  : 'Wikidata reference is empty'
    warning : 'extension mismatch'

Where a translation doesn't exist for a file format, the translation will fall back, for now, to English (EN).

Building

Siegfried's binary representation of a signature file is called an Identifier. The identifier must be compiled. It can be compiled using many different combinations discussed in the Roy documentation. We will focus on Wikidata with PRONOM and Wikidata without PRONOM.

PRONOM

Wikidata will be build with PRONOM by default. What this does is look for PRONOM identifiers in the Wikidata dataset. Those identifiers might not have a signature associated with them where PRONOM does. We supercharge Wikidata by making use of a set of both information sources.

No PRONOM

To build a Wikidata identifier without PRONOM, for example, to test your Wikidata developed signatures more easily:

Logging

Roy will output information specific to the Wikidata identifier which can be helpful to you to see the amount of information in Wikidata that you can expect to use. An example below shows that there are 192 records with signatures. This means the Wikidata identifier on its own can identify up to 192 file formats through binary pattern matching.

{
  "AllSparqlResults": 13187,
  "CondensedSparqlResults": 4582,
  "SparqlRowsWithSigs": 2927,
  "RecordsWithPotentialSignatures": 196,
  "FormatsWithBadHeuristics": 4,
  "RecordsWithSignatures": 192,
  "MultipleSequences": 11,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 134,
  "RecordCountWithLintingMessages": 116
}

Linting can help you in developing file format signatures in Wikidata and identifying errors in that process. It is described below.

Linting

It can be helpful to Wikidata signature developers to identify potential issues in Wikidata signatures. The technique used in version 1.9.x of Siegfried is not a perfect technique. We anticipate greater schema checking against the Wikidata data source using ShEx in time.

For now, when you build using the following parameters, you will see additional "linting" information to help you identify records in Wikidata that can be improved with your attention:

  "AllLintingMessages": [
    "Linting: WARNING no encoding: URI: http://www.wikidata.org/entity/Q4839791 Critical: false",
    "Linting: WARNING no provenance: URI: http://www.wikidata.org/entity/Q4839791 Critical: false",
    "Linting: WARNING no provenance date: URI: http://www.wikidata.org/entity/Q98843338 Critical: false",
    "Linting: ERROR bad heuristic: URI: http://www.wikidata.org/entity/Q1109779 Critical: true",
    "Linting: ERROR blank node returned for offset: URI: http://www.wikidata.org/entity/Q26546575 Critical: false",
    "Linting: WARNING no relativity: URI: http://www.wikidata.org/entity/Q939636 Critical: false",
  ],

We can go into more information about linting issues:

ERROR bad heuristic

A bad heuristic means that information vital to understanding a signature is missing. Wikidata is the first place to look for this inconsistency. An example of a bad heuristic might be a file format listed with two BOF sequences but no offset to describe how one is related to the other. The weakness might be in the code where the code does not demonstrate enough complexity to work with the data that is available to it. If you believe a heuristic can be created for the information in Wikidata please open a new Siegfried issue.

ERROR blank node returned

A blank node can be returned for any field that Roy/Siegfried anticipates using. A blank node error is returned for a field that has been deliberately listed in Wikidata but for which there has been no value supplied, e.g. the author recognizes there should be something but does not know what that something is. Roy cannot work with this field/value as it is incomplete. The best way to remedy this is to complete the record in Wikidata.

WARNING no encoding

A no-encoding error exists in Roy at present because Wikidata can encode information using multiple signature encoding, e.g. hexadecimal, ascii. If an encoding isn't specified Roy will try to parse or convert the data to hexadecimal. If it works no error will be thrown and we can use the signature. This error is a signal to the signature developer to rectify the issue in the Wikidata record.

WARNING no provenance

A no provenance error indicates that the signature information in Wikidata has no listed reference. A default value will be used in the output by Siegfried. The remedy is to attempt to find a suitable provenance for the information in Wikidata and edit the record directly.

WARNING no provenance date

A no provenance date error indicates that the signature information in Wikidata has no listed date for its reference. No value will be used by Siegfried. The remedy is to attempt to find a suitable provenance date for the information in Wikidata and edit the record directly.

WARNING no relativity

A no relativity error indicates that the signature information in Wikidata has no listed relativity value, e.g. it is not listed as BOF (beginning of file) or EOF (end of file). A default of BOF value will be used in the output by Siegfried. The remedy is to attempt to find a suitable provenance for the information in Wikidata and edit the record directly.

Inspect

Signatures can be inspected on a case-by-case basis. To lookup a Wikidata identifier to see the signature compiled into the identifier you can do the following

E.g. FLAC (Free Lossless Audio Codec)

FORMAT INFO: NAME: 'FLAC'
MIMETYPE: 'AUDIO/X-OGG; AUDIO/X-FLAC; AUDIO/FLAC'
SOURCES: 'GARY KESSLER'S FILE SIGNATURE TABLE (SOURCE DATE: 2017-08-08) PRONOM (OFFICIAL (FMT/279))'
QID: (Q27881556)
globs: *.flac, *.oga
sigs: (B:0 seq "fLaC\x00\x00\x00\"")
      (B:0..4 seq "fLaC\x00\x00\x00\"")
superiors: none

Scanning

One an identifier is built, Siegfried does not require any special invocation to use Wikidata. A standard command might be sf <your-file-name>. The result, e.g. for img.bmp will look something like as follows:

---
siegfried   : 1.9.1
scandate    : 2020-11-15T22:24:56-05:00
signature   : default.sig
created     : 2020-11-15T22:24:35-05:00
identifiers : 
  - name    : 'wikidata'
    details : 'wikidata-definitions-1.0.0 (2020-11-15)'
---
filename : 'img.bmp'
filesize : 35
modified : 2020-11-15T22:26:09-05:00
errors   : 
matches  :
  - ns       : 'wikidata'
    id       : 'Q27596325'
    format   : 'Windows Bitmap, version 4'
    URI      : 'http://www.wikidata.org/entity/Q27596325'
    mime     : 
    basis    : 'extension match bmp; byte match at 0, 35'
    source   : 'PRONOM (Wikidata) (source date: 2017-08-08)'
    warning  : 
    software : 

Terminology