richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

Demonstration of Software output via Siegfried YAML #152

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

This is a conversation starter around what can be added to the Siegfried output via Wikidata. It is also a demonstration of how to do that. The example might need some work as it modifies the Siegfried writer and I think if special cases need handling within the writer, more work might be needed to make that a properly extensible effort moving forward. That is also a conversation.

Example:

---
siegfried   : 1.9.1
scandate    : 2020-11-15T11:37:07-05:00
signature   : default.sig
created     : 2020-11-15T11:25:27-05:00
identifiers : 
  - name    : 'wikidata'
    details : 'wikidata-definitions-2.x.x (2020-11-15)'
---
filename : 'skpro/test1'
filesize : 10
modified : 2020-07-08T23:41:53-04:00
errors   : 
matches  :
  - ns       : 'wikidata'
    id       : 'Q27596100'
    format   : 'Windows Bitmap, version 1'
    URI      : 'http://www.wikidata.org/entity/Q27596100'
    mime     : 
    basis    : 'byte match at 0, 10'
    source   : 'PRONOM (Wikidata) (source date: 2017-08-08)'
    warning  : 'extension mismatch'
    software : 
        Converseen: http://www.wikidata.org/entity/Q97012479
---
filename : 'skpro/test6'
filesize : 8
modified : 2020-07-08T23:53:57-04:00
errors   : 
matches  :
  - ns       : 'wikidata'
    id       : 'Q4045294'
    format   : 'New Executable'
    URI      : 'http://www.wikidata.org/entity/Q4045294'
    mime     : 
    basis    : 'byte match at [[0 2] [6 2]]'
    source   : 'Wikidata reference is empty'
    warning  : 'extension mismatch'
    software : 
        Windows 8: http://www.wikidata.org/entity/Q5046
        Windows 7: http://www.wikidata.org/entity/Q11215
        Windows 98: http://www.wikidata.org/entity/Q483132
        Windows 10: http://www.wikidata.org/entity/Q18168774
---
filename : 'skpro/test9'
filesize : 35
modified : 2020-07-08T23:53:34-04:00
errors   : 
matches  :
  - ns       : 'wikidata'
    id       : 'Q27596325'
    format   : 'Windows Bitmap, version 4'
    URI      : 'http://www.wikidata.org/entity/Q27596325'
    mime     : 
    basis    : 'byte match at 0, 35'
    source   : 'PRONOM (Wikidata) (source date: 2017-08-08)'
    warning  : 'extension mismatch'
    software : 

TODO

More to come...

richardlehane commented 3 years ago

Thanks Ross - this is an interesting POC!

If it's desirable to have structured data within results, I'd suggest starting higher in the stack and look at the Values() method in the Identification interface, which currently just returns a slice of strings. Making the change here would lead to a cleaner implementation in the writers, without the need for special casing.

But what would you change it to? For the software use case, it would need to be a map, because each of the software items has keys (the software name) and values (the Q reference). Perhaps you could introduce a new Value interface which would be a string normally, but could also be a map or a list (or even an int or other things down the line)? I.e. a Values() []Value signature.

A simplification would just be to say that you can either have a single string or a list of strings. I.e. a Values() [][]string signature. This would be easier to implement. But it would be less expressive in your software case and you'd have to accept:

software:
    -     Windows 8 (http://www.wikidata.org/entity/Q5046)
    -     Windows 7 (http://www.wikidata.org/entity/Q11215)
    -     Windows 98 (http://www.wikidata.org/entity/Q483132)
    -     Windows 10 (http://www.wikidata.org/entity/Q18168774)

But if I guess you are compromising like that, you could also just do something like:

software : Windows 8 (http://www.wikidata.org/entity/Q5046); Windows 7 (http://www.wikidata.org/entity/Q11215); Windows 98 (http://www.wikidata.org/entity/Q483132); Windows 10 (http://www.wikidata.org/entity/Q18168774)