richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Perform analysis on stdin? #96

Closed pm64 closed 7 years ago

pm64 commented 7 years ago

From what I can see, Siegfried is only able to analyze files on disk. Is there any feature planned that would allow analysis of bytes piped in via stdin?

richardlehane commented 7 years ago

Hi @pm64 thanks for this message.

I hadn't planned to add this but it would be a fairly straight forward addition that I'm happy to consider (the underlying API can accept streams or files - https://godoc.org/github.com/richardlehane/siegfried#Siegfried.Identify).

The reason I've never added this before is because if you use standard PRONOM sigs then it would normally be much more efficient to let sf do the file handling. Lots of PRONOM sigs have end of file as well as beginning of file sequences & also wildcards that can appear anywhere in file: this means potentially lots of seeking and if you are supplying bytes rather than a file then those bytes will all be copied and stored by sf in memory until the match is made. So if you did want to go this route I'd suggest you'd probably also want to use the roy tool to customise a signature file that has no end of file sequences and has a fixed scan size. E.g.roy -bof 128000 -noeof. Does that make sense and fit with your use case?

The only other hurdle is I've stupidly already use the - flag (which is traditionally used to say read from STDIN) for reading lists of files to scan from stdin. So adding this feature would also necessitate an API change (& perhaps copying the file command's use of -f flag for reading lists of files).

pm64 commented 7 years ago

Hi @richardlehane, thank you for your thoughtful reply.

Your suggestion of excluding the EOF sequences from the signature file might help immensely in my use case, even though the file is already in RAM, depending on how I wind up streaming the bytes to stdin.

Either way, I'm pleased to learn this functionality is already supported on the API level. I know the typical use case is to read files from disk, but I think many Siegfried fans will appreciate the ability to read from stdin and the increased flexibility such a feature would provide.

richardlehane commented 7 years ago

Hi @pm64 this is now implemented in sf 1.7.0. Use sf - to scan stdin. Let me know if you hit any issues

pm64 commented 7 years ago

@richardlehane, I'm testing 1.7.0 for my use case and so far it is working flawlessly. Can't thank you enough for this awesome update!! Will keep you posted.

richardlehane commented 7 years ago

thanks @pm64 that's great to hear