richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Identify files by extension only #102

Closed nkrabben closed 7 years ago

nkrabben commented 7 years ago

I'd like to use Siegfried for a drive survey tool. Depending on the amount of time available, I'd like to turn various Siegfried functions on or off. For example, calculating hashes if time permits or not. If possible, I'd like to perform format identification via extensions alone, as a less-certain-but-still-useful quick survey method.

Is it possible to implement a command line flag to do this in Siegfried? If not, could I have to recompile the format signatures without any byte signatures to achieve the same effect?

richardlehane commented 7 years ago

Hi Nick

In terms of just doing ext matching, the closest you can get to this at the moment is something like: roy build -name speedy -bof 1 -noeof -notext -nocontainer speedy.sig

You can then run sf with this sig using: sf -sig speedy.sig FILE | DIR.

If you are connecting to another tool/workflow you might consider using the HTTP API (sf -serve localhost:5138). This API accepts both "sig" and "hash" params so could be used selectively depending on file size or the options selected by users in your tool etc..

That custom signature above still does a minimal amount of byte matching (no end of file signatures and just looks at the first few bytes). There is currently no way to switch byte matching off altogether (not a use case I ever envisioned!).

I'll look at adding a -nobyte flag to roy that would work similarly to -notext, -nocontainer, -noxml etc. and would omit the byte matcher when building the signature file. This will be nice to have for consistency in any case.

A couple of caveats:

  1. even with a -nobyte flag that excludes byte matching, sf will still do a little bit of IO in normal operation. This is because the sf command itself grabs a buffer for the file (https://github.com/richardlehane/siegfried/blob/master/cmd/sf/sf.go#L176) and this results in a peek at the first 8k bytes. The reason it does this is because file contents aren't just used by the matching process, they're also necessary for hashing and decompressing archive formats, so it makes sense to do the read once and share the buffer amongst these different processes. If you don't want any IO at all - a couple of ways to work around it... if you use the HTTP API you could use the POST option and just send a dummy empty file with every request; or you could write a small go script invoking siegfried as a golang package, giving you full control (easier than it perhaps sounds I promise!).

  2. currently if a file is matched by extension only, and there are multiple results for that extension, then sf reports UNKNOWN and gives the multiple results as possibilities in the warning field. Rationale is it is a weak basis for a result and so better for the user to make a manual ID rather than giving multiple hits. This will probably happen quite a bit unfortunately when doing ext only matching e.g. consider all the PUIDs that have a *.pdf extension. What would your preference be in this scenario? To get multiple results in these cases? Or get a single UNKNOWN result with a big warning message? If you'd prefer to report multiple results in these cases, there is an additional flag you can add when building your signature file: -multi exhaustive. I.e. roy build -name speedy -bof 1 -noeof -notext -nocontainer -multi exhaustive speedy.sig currently.... but, in the next release, roy build -name extonly -nobyte -notext -nocontainer -multi exhaustive extonly.sig

richardlehane commented 7 years ago

as discussed v1.7.4 introduces a -nobyte flag for roy.

Example usage:

roy build -name extonly -nobyte -nocontainer -notext -multi exhaustive extonly.sig sf -sig extonly.sig DIR