richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Match produces 3 MB of basis information #111

Closed nkrabben closed 4 years ago

nkrabben commented 6 years ago

Running siegfried across a collection, it's really slowing down on a group of mp3's. https://www.nationalarchives.gov.uk/pronom/fmt/134

The basis field for these matches can be up to 3,000,000 characters long with repetitions of similar data such as [795 105] 9,650 and [807 105] 13,068. I'm not sure what's causing this. If useful, I can probably provide a copy of the files causing this bug in the new year.

richardlehane commented 6 years ago

thanks Nick - this is something that's cropped up before (https://github.com/richardlehane/siegfried/issues/94). My previous fixes have been stop-gaps but I've been tinkering with a more fundamental fix & it is good to have this as a prompt to work on it.

Basically this issue occurs for "noisy" signatures with multiple segments that generate lots of partial matches. fmt/134 is probably the worst offender. I'm currently too exhaustive in following up these matches - I'd like to make this bit of the code "lazier", it has just been hard to do so without breaking everything.

richardlehane commented 6 years ago

Hi @nkrabben I'm working on this issue now - if you have a sample you can share (either here or via email to keep private) would be great help

richardlehane commented 6 years ago

this is partly fixed (the verbose basis bit) in v1.7.9. But I have more work to do on the speed side of this issue, so re-opening

richardlehane commented 4 years ago

second part of this bug (slowdown for MP3 with lots matches) is now fixed on develop branch and will be in next release, see this issue: https://github.com/richardlehane/siegfried/issues/128