richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

Wikidata harvest using non-default English results in non-English encoding values #153

Closed ross-spencer closed 2 years ago

ross-spencer commented 3 years ago

Other fields might be affected but one clear one is hexadecimal becomes Hexadezimalsystem. It might be possible to use different encoding per field in Wikidata but it's not yet clear what the right solution is.

Harvest using: ./roy harvest -wikidata -lang de and then inspect the Wikidata output in the Siegfried folder.

A useful example: https://w.wiki/rva All hex labels via getentities API endpoint: here

ross-spencer commented 3 years ago

This was an interesting one. We have a draft PR here: https://github.com/richardlehane/siegfried/pull/161

The difficulty solving this initially was treating each string as equal when in reality we have values which are for display (the format label) which we can return in another language via Wikidata. We also have values that we need the computer to process which can be in any language, or any form of string - and in-fact, Wikidata/linked data provides IRIs for this very purpose - a relativity of BOF can be Dateianfang or beginning of file but it will always be http://www.wikidata.org/entity/Q35436009.

We also had a nice problem where requesting a language string for a field that doesn't have a translation in Wikidata resulted in just a QID and no easily readable data at all. S0:

      "referenceLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "Gary Kessler's File Signature Table"
      },

Might appear as:

      "referenceLabel" : {
        "type" : "literal",
        "value" : "Q12345"
      },

Solution

We can add the following to our SPARQL:

It will do two things:

  1. Continue to let users select their own language code.
  2. If a label doesn't exist in a chosen language, will fall-back to en.

In combination with this, we have two fields that we currently rely on their labels for. encoding and relativity, examples: hexadecimal, beginning of file. Instead of relying on the label values for these, we can use the IRIs provided by Wikidata instead, e.g. http://www.wikidata.org/entity/Q82828 and http://www.wikidata.org/entity/Q35436009 respectively. These will never change. Though over time we may find other values to add for each of these.

This is the focus of the change at #161 but the write-up here may be useful/interesting to others.

ross-spencer commented 2 years ago

Merged into develop via https://github.com/richardlehane/siegfried/pull/161 Commit: https://github.com/richardlehane/siegfried/commit/0128814b992c99156e9465c8eca849587c924202