richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

Multiple identification unexpected outcomes #146

Closed Dclipsham closed 1 year ago

Dclipsham commented 3 years ago

This could be my misuse/misunderstanding of params...

I need to test for multiple identification outcomes as part of a service we're building. To mimic this test, I have created a pair of invalid files (that is, they don't represent real data that can be decoded by any format renderer) that contain the identification patterns required for multiple formats. hybrid_jpeg_html_file.jpg contains the identification patterns of both Raw JPEG Stream (fmt/41), and Hypertext Markup Language (fmt/96) hybrid_jpeg_mov_file.mov contains the identification patterns of both Raw JPEG Stream (fmt/41), and Quicktime (x-fmt/384) fake_hybrid_files.zip

DROID returns multi ID as desired, but Siegfried (out of the box, Windows 10) seems to only return first match. Using 'roy build -multi 3' and re-running gives the expected multiple identification outcome, but has the unwanted side-effect of no longer honouring priority relationships (in the example below, Music XML (fmt/896) & XML (fmt/101). 'roy build -multi 2' stops giving me the multiple ID I'm after.

Outputs below: multi 3:

C:\temp\siegfried_1-8-0_win64\win64>roy build -multi 3

C:\temp\siegfried_1-8-0_win64\win64>sf ..\..\hybrid_jpeg_html_file.jpg
---
siegfried   : 1.8.0
scandate    : 2020-08-05T14:21:13+01:00
signature   : default.sig
created     : 2020-08-05T14:20:23+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V96.xml; container-signature-20200330.xml; multi set to comprehensive (3)'
---
filename : '..\..\hybrid_jpeg_html_file.jpg'
filesize : 60
modified : 2020-08-05T12:12:44+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/96'
    format  : 'Hypertext Markup Language'
    version :
    mime    : 'text/html'
    basis   : 'byte match at [[3 5] [51 7]] (signature 1/2)'
    warning : 'extension mismatch'
  - ns      : 'pronom'
    id      : 'fmt/41'
    format  : 'Raw JPEG Stream'
    version :
    mime    : 'image/jpeg'
    basis   : 'extension match jpg; byte match at [[0 3] [58 2]] (signature 1/2)'
    warning :

C:\temp\siegfried_1-8-0_win64\win64>sf ..\..\hybrid_jpeg_mov_file.mov
---
siegfried   : 1.8.0
scandate    : 2020-08-05T14:45:58+01:00
signature   : default.sig
created     : 2020-08-05T14:20:23+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V96.xml; container-signature-20200330.xml; multi set to comprehensive (3)'
---
filename : '..\..\hybrid_jpeg_mov_file.mov'
filesize : 69
modified : 2020-08-05T12:46:09+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/41'
    format  : 'Raw JPEG Stream'
    version :
    mime    : 'image/jpeg'
    basis   : 'byte match at [[0 3] [67 2]] (signature 1/2)'
    warning : 'extension mismatch'
  - ns      : 'pronom'
    id      : 'x-fmt/384'
    format  : 'Quicktime'
    version :
    mime    : 'video/quicktime'
    basis   : 'extension match mov; byte match at 4, 12 (signature 1/8)'
    warning :

Multi 2:

C:\temp\siegfried_1-8-0_win64\win64>roy build -multi 2

C:\temp\siegfried_1-8-0_win64\win64>sf ..\..\hybrid_jpeg_mov_file.mov
---
siegfried   : 1.8.0
scandate    : 2020-08-05T14:46:13+01:00
signature   : default.sig
created     : 2020-08-05T14:46:02+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V96.xml; container-signature-20200330.xml; multi set to positive (2)'
---
filename : '..\..\hybrid_jpeg_mov_file.mov'
filesize : 69
modified : 2020-08-05T12:46:09+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/384'
    format  : 'Quicktime'
    version :
    mime    : 'video/quicktime'
    basis   : 'extension match mov; byte match at 4, 12 (signature 1/8)'
    warning :

C:\temp\siegfried_1-8-0_win64\win64>sf ..\..\hybrid_jpeg_html_file.jpg
---
siegfried   : 1.8.0
scandate    : 2020-08-05T14:46:25+01:00
signature   : default.sig
created     : 2020-08-05T14:46:02+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V96.xml; container-signature-20200330.xml; multi set to positive (2)'
---
filename : '..\..\hybrid_jpeg_html_file.jpg'
filesize : 60
modified : 2020-08-05T12:12:44+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/41'
    format  : 'Raw JPEG Stream'
    version :
    mime    : 'image/jpeg'
    basis   : 'extension match jpg; byte match at [[0 3] [58 2]] (signature 1/2)'
    warning :

multi 3 scanning of legitimate MusicXML file (unable to share):

C:\temp\siegfried_1-8-0_win64\win64>roy build -multi 3

C:\temp\siegfried_1-8-0_win64\win64>sf ..\..\ActorPreludeSample.musicxml
---
siegfried   : 1.8.0
scandate    : 2020-08-05T15:35:04+01:00
signature   : default.sig
created     : 2020-08-05T15:34:49+01:00
identifiers :
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V96.xml; container-signature-20200330.xml; multi set to comprehensive (3)'
---
filename : '..\..\ActorPreludeSample.musicxml'
filesize : 1225562
modified : 2020-08-05T15:33:10+01:00
errors   :
matches  :
  - ns      : 'pronom'
    id      : 'fmt/101'
    format  : 'Extensible Markup Language'
    version : '1.0'
    mime    : 'application/xml'
    basis   : 'byte match at 0, 19'
    warning : 'extension mismatch'
  - ns      : 'pronom'
    id      : 'fmt/896'
    format  : 'MusicXML'
    version :
    mime    : 'application/vnd.recordare.musicxml+xml'
    basis   : 'extension match musicxml; byte match at 0, 147'
    warning :
richardlehane commented 3 years ago

Hi @Dclipsham sf's default mode doesn't return first match, it returns the first match for which there is nothing superior in the priority tree (it applies priorities during scanning, rather than after). This means in practice sf is much less likely than DROID to return multiple IDs in its default mode & it generally won't give the results you want for polyglot files.

You're right to use the "-multi" switch to change this behaviour but unfortunately all the currently available modes dispense with the priorities altogether. I haven't created a multi mode to replicate DROID's behaviour of doing the scan & then using the priorities afterwards to filter the results. But certainly possible to create such a mode and I think it might be a nice feature to have, so I'll mark this as a feature request.

Hopefully there'll be a new release in the next couple of months and I'll try to get this in it all the best Richard

Dclipsham commented 3 years ago

Thanks Richard, this makes sense and would be a welcome addition.

I'm curious as to why, in the above examples SF favoured one over the other where there isn't a PRONOM priority set.

i.e. in the JPG/HTML hybrid it chose JPG, but in the MOV/JPG hybrid it chose MOV. Is it that the positive extension outcome weighted it further?

richardlehane commented 3 years ago

For the JPG/HTML - it started scanning and got the JPG match before the HTML match. Once it confirmed the JPG match it would then only consider further matches if they had priority over JPG (e.g. more specific flavours of JPG). For the MOV/JPG it hit MOV first and would only then consider more specific MOVs, discarding the JPG. I.e. it kind of funnels in on the most specific possible match

richardlehane commented 1 year ago

v1.10.0 has a new "droid" multi mode when building with roy