richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

text matcher not allocating hits to correct identifier when have multiple identifiers inc. mimeinfo #101

Closed richardlehane closed 7 years ago

richardlehane commented 7 years ago

bug mentioned in #100 but adding second ticket to track. Have made a signature file which makes this issue a bit more obvious (added tika, freedesktop and pronom identifiers to a signature file in that order). Get these results on a text file:

---
siegfried   : 1.7.2
scandate    : 2017-05-15T11:54:23+10:00
signature   : default.sig
created     : 2017-05-15T11:52:05+10:00
identifiers : 
  - name    : 'tika'
    details : 'tika-mimetypes.xml'
  - name    : 'freedesktop.org'
    details : 'freedesktop.org.xml'
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V88.xml; container-signature-20160927.xml'
---
filename : 'bla.txt'
filesize : 27
modified : 2017-05-15T11:52:37+10:00
errors   : 
matches  :
  - ns      : 'tika'
    id      : 'text/plain'
    format  : 
    mime    : 'text/plain'
    basis   : 'extension match txt; text match ASCII; text match ASCII; text match ASCII'
    warning : 'match on filename and text only; byte/xml signatures for this format did not match'
  - ns      : 'freedesktop.org'
    id      : 'UNKNOWN'
    format  : 
    mime    : 'UNKNOWN'
    basis   : 
    warning : 'no match; possibilities based on filename are text/plain'
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    basis   : 
    warning : 'no match; possibilities based on extension are x-fmt/111'

In this example, the tika identifier "steals" all the text hits from the subsequent identifiers and reports in own result. Ross's example showed the pronom and tika identifiers both got a text match, so seems that this issue probably in mimeinfo code (i.e. pronom identifiers are not stealing text hits).

richardlehane commented 7 years ago

fixed with 1.7.3 release