richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

file with many FF xx sequences grinds to a halt #128

Closed richardlehane closed 4 years ago

richardlehane commented 5 years ago

A file with lots of FF xx sequences can generate so many hits against https://nationalarchives.gov.uk/PRONOM/fmt/134 that matching grinds to a seeming halt. Even a 500 byte file filled with FF xx sequences can take > 30 seconds to complete.

Can be "solved" (only as a work-around) by building a signature file without fmt/134: roy build -exclude fmt/134

Proper solution means optimising the matching code in the bytematcher

report and sample file provided by @fozboz

jesswhyte commented 4 years ago

I am running another collection from the same donor and coming up against this issue again. Looking at some of the problem files (Sound Designer II Audio Files (.sd2)), they also have long FF sequences.

richardlehane commented 4 years ago

Hi Jess, I've not forgotten this bug, it has proven pretty tricky to resolve, but have been working on it, unfortunately mostly in my head. Thank you for the extra incentive to fix, I hope to release something in the next few weeks!

richardlehane commented 4 years ago

Just an update on this issue: I've finally got a working solution to this (on the "develop" branch") & it will be in the next sf release. I'll time the release to follow the next PRONOM update (expected next week - https://twitter.com/Britpunk80/status/1207331770301108229)

richardlehane commented 4 years ago

I believe this is now fixed in 1.8.0. Please re-open if you encounter any similar issues