richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Certain files send Siegfried in a loop #94

Closed workflowsguy closed 7 years ago

workflowsguy commented 7 years ago

I have recently encountered two files (one of which is attached) which siegfried is not able to handle. If called with sf FILE.EXT, siegfried will never finish. Also, it seems that during this it continuously consumes main memory until all is exhausted, bringing down the system.

Data:

siegfried 1.6.7
default.sig (2016-11-22T20:59:52+11:00)
identifiers: 
  - pronom: DROID_SignatureFile_V88.xml; container-signature-20160927.xml

installed on OS X 10.9.5 with brew install richardlehane/digipres/siegfried

[Sheet Music - Score - Piano] Coldplay - Clocks.pdf

richardlehane commented 7 years ago

thanks for this report @workflowsguy

From an initial look, it seems to be that sf is getting bogged down scanning endlessly for fmt/134 (MPEG 1/2). fmt/134 is a pretty verbose signature so it may be that sf hits a deluge of false partial matches and runs out of memory trying to satisfy them all.

An interim fix that works for this file is to change roy's segmentation settings with the -range flag (along with the -distance and -choices flags, this flag controls the way roy splits up signatures when building the search tree). If you do roy build -range 1028 for example, the scan doesn't hang. The segmentation settings don't affect accuracy but they do have an effect on the speed of scanning.

sf shouldn't hang no matter what signatures are used, and what the segmentation settings are, so I will see what I can do for the next release to fix this properly

richardlehane commented 7 years ago

this is now fixed in sf 1.7.0.

The current fix is to change signature processing rules to prevent large segments. Digging into this code, I think there are some opportunities to optimise it & so hope to do a bit more on this issue over the next few months