richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Question re: byte matching algorithm #133

Closed jnik-aarhus closed 4 years ago

jnik-aarhus commented 4 years ago

Hi Richard,

I am currently working as a software developer for the City Archives in Aarhus, Denmark, and we are interested in incorporating PRONOM signatures in our work. To that end, I have been perusing several tools taking advantage of DROID signature files, siegfried included. As I am unfamiliar with go, I have some questions regarding the byte matcher.

I am interested in knowing how you use the PRONOM signatures as given in the DROID signature file. Per their paper, DROID uses Boyer-Moore-Horspool for pattern matching. Does siegfried take the same approach, or is there another algorithm involved? It is my understanding that fido, for instance, translates the byte sequences to regex and uses that for pattern matching instead, but I have seen no indications of regex use in siegfried's source code.

Thank you in advance for taking the time to answer my questions!

-- Nina

richardlehane commented 4 years ago

Hi Nina the place to look for the algorithmic stuff is in this external package I wrote to support sf: https://github.com/richardlehane/match. There are a bunch of my experiments in there (including ongoing ones) but the code currently used is in the "fwac" folder. This is a tweaked version of the Aho Corasick multiple string matching algorithm (https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm). The rationale for using a multiple string matching algorithm is that PRONOM has 1000s of signatures, so it makes sense to search them all at once, if possible.

Re. DROID, I'm not 100% sure what they do, but I suspect they no longer just use Boyer-Moore-Horspool. The package that powers their byte matching is written by Matt Palmer: https://github.com/nishihatapalmer/byteseek & is worth a look. Matt may be able to give you more insights.

You might have seen it already but, if you're looking to compare the three tools, you might find my benchmarks page helpful. It compares both speed and accuracy of the tools with different configurations and against different large corpuses.

Happy to help with any other info you need, Richard

jnik-aarhus commented 4 years ago

Hi Richard, Thank you for a very in-depth answer! I have one more question. In siegfried do you use PRONOM signatures directly, e.g. these, or do you use the preprocessed DROID signature files? Thanks again.

-- Nina

richardlehane commented 4 years ago

sf has its own custom binary signature format. You can use the roy tool to build sf signature files yourself with either the PRONOM xml or DROID xml files as input (or the other signature types that sf supports: mimeinfo & library of congress FDD).

For the default signature file, I use PRONOM xml as input as there are some small errors in the DROID xml (e.g. I think for the flac signature).