richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

siegfried over S3 #169

Closed Rbn3D closed 2 years ago

Rbn3D commented 2 years ago

I'm trying to use siegfried to identify files that are not located in the machine FS, but on a cloud storage (Amazon S3 in this case).

I've tried creating a golang project that uses siegfried as a library, in order to pass the S3 stream of a file directly to siegfried. It worked, but performance is very bad, (and it's using a stream, which should be relatively fast).

Seems like it's reading the whole file in every case I tested, not only the relevant parts.

Is this going to be supported? If not, is there any chance that the identification engine gets modified to read only the necessary parts of the file instead of reading it completely?

Thank you.

richardlehane commented 2 years ago

Assuming you're using PRONOM signatures, the challenges of fast identification of streams are: a) many of the byte signatures have end-of-file or wildcard patterns, which requires a full read of the stream (vs if I have access to the file I can do a seek to check the end of the stream) b) container signatures require unpacking the stream as a zip or OLE, probably also requiring a full read of the stream

You could mitigate these issues by using the roy tool to edit your signature file. E.g. roy build -noeof -nocontainer -bof 16000 would build a signature file that would only scan the first 16k bytes, drop all end-of-file patterns, and not include container signatures. It would be a lot faster but may of course impact the quality of the identification!

Another option would be to use one of the other types of supported signatures e.g. tika mimetypes as these have no end-of-file, wildcard or container signatures to worry about, so much better for the streaming use case. But you won't get PRONOM IDs.

richardlehane commented 2 years ago

revisiting this... as Diego Navarro points out, can do range requests on S3 objects so do not need to treat them as plain streams. Should be possible to write a custom reader wrapping the S3 protocol. I'll attempt this as an external package.