richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

misidentification: x-fmt/45 files identifying as fmt/40 #89

Closed richardlehane closed 8 years ago

richardlehane commented 8 years ago

Two docs in the govdocs corpus are being misidentified:

govdocs_selected\DOC_41\278791.doc
govdocs_selected\TEXT_47\192367.text

Relevant segments in the signature do seem to be matching:

capture

Not sure why hits are being reported at offsets 1, 2, 4, 6, and 8 since x-fmt/45's signature only has a bitmask at offset 10 (maybe the minimum offset isn't being honoured by the frame matcher?).

richardlehane commented 8 years ago

This bug is caused by the repetition of File elements in the container signature file. Will add a new parsing rule that only takes the last of a set of repeated elements.

capture

richardlehane commented 8 years ago

b.t.w. reports of all those hits will be because the signature is being matched by the ac engine within bytematcher rather than by the frame engine (choices set to 128 and range to 4096) - so that working as expected