Closed ross-spencer closed 2 years ago
This was an interesting one. We have a draft PR here: https://github.com/richardlehane/siegfried/pull/161
The difficulty solving this initially was treating each string as equal when in reality we have values which are for display (the format label) which we can return in another language via Wikidata. We also have values that we need the computer to process which can be in any language, or any form of string - and in-fact, Wikidata/linked data provides IRIs for this very purpose - a relativity of BOF
can be Dateianfang
or beginning of file
but it will always be http://www.wikidata.org/entity/Q35436009
.
We also had a nice problem where requesting a language string for a field that doesn't have a translation in Wikidata resulted in just a QID and no easily readable data at all. S0:
"referenceLabel" : {
"xml:lang" : "en",
"type" : "literal",
"value" : "Gary Kessler's File Signature Table"
},
Might appear as:
"referenceLabel" : {
"type" : "literal",
"value" : "Q12345"
},
Solution
We can add the following to our SPARQL:
service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], <<lang>>, en". }
It will do two things:
en
.In combination with this, we have two fields that we currently rely on their labels for. encoding
and relativity
, examples: hexadecimal
, beginning of file
. Instead of relying on the label values for these, we can use the IRIs provided by Wikidata instead, e.g. http://www.wikidata.org/entity/Q82828
and http://www.wikidata.org/entity/Q35436009
respectively. These will never change. Though over time we may find other values to add for each of these.
This is the focus of the change at #161 but the write-up here may be useful/interesting to others.
Other fields might be affected but one clear one is
hexadecimal
becomesHexadezimalsystem
. It might be possible to use different encoding per field in Wikidata but it's not yet clear what the right solution is.Harvest using:
./roy harvest -wikidata -lang de
and then inspect the Wikidata output in the Siegfried folder.A useful example: https://w.wiki/rva All hex labels via getentities API endpoint: here