richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
224 stars 30 forks source link

Function to return reader of nested member #119

Closed bruth closed 6 years ago

bruth commented 6 years ago

Hi! I am looking to implement a function that would ideally leverage the recursive unpacking and decompression this package already does. The function signature would look something like this:

func ReadMember(path string, member string) (io.ReadCloser, error)

Where path would be the path to the source file and member would be the name of the member within the file whose byte stream will be returned in the io.ReadCloser. For now, member would be the filename returned by Siegfried that delimits paths by # when denoting nested files.

For example, given a (contrived) archive:

foo.zip
    dir/bar.zip
        baz.csv.gz

Calling ReadMember("foo.zip", "foo.zip#dir/bar.zip#baz.csv.gz#baz.csv") would return a io.ReadCloser that would be the decompressed contents of baz.csv.

The use case is to dynamically read out portions of an archive given the semantics of Siegfried.

A more general function that would walk the members of an input, but that would need to be limited to leaves in the hierarchy.

Do you have a suggestion on how to implement this given the components available in this package?

richardlehane commented 6 years ago

Hi Byron thanks for the issue. I'll have a think about this. But just to clarify - do you need siegfried at all in terms of its file format ID functionality or are you just trying to replicate some of the ancillary file walk/unpacking functionality from the command line tool (i.e. if you know ahead of time the member path then you also know ahead of time what formats you need to unpack?)?

bruth commented 6 years ago

Thanks Richard. I should have stated this up front, yes the file format detection specifically relying on the standards is necessary for my use case. My team and I are building a data catalog and archive for biomedical data. We are currently using Archivematica as a pipeline to prepare archive packages and it uses PRONOM as the file format standard.

I need to look at the Roy tool more, but an unrelated question is how to add support for "unofficial" or non-registered file formats. We have genomic data files that we are cataloging such as VCF and FASTQ files. My assumption is that I can create a custom signature file that includes a detection mechanism for these formats?

richardlehane commented 6 years ago

I've had a look at this again this morning & is definitely possible but unfortunately I think at the moment any solution would be pretty ugly and involve a lot of copy/paste of non-exported bits of the siegfried codebase: specifically the decompress.go file within the cmd/sf package & the internal/siegreader package (which is what you'd need to get an io.Reader). I'm currently working on a new release and will look at either exporting some of this stuff so can be used externally or create a helper function for this use case within the top level siegfried package.

Re. a custom signature file - yes you'd use the roy tool for this. See this wiki page for instructions.

Basically the steps are:

  1. use Ross Spencer's signature development utility to make a DROID compatible signature file;
  2. copy that file into a "custom" folder in your siegfried home e.g. ~/home/siegfried/custom/my_sig.xml;
  3. then use the "-extend" flag with roy build (e.g. roy build -name biomedical -extend my_sig.xml biomedical.sig).

You can invoke sf with custom signatures using the -sig flag. E.g. sf -sig biomedical.sig ....

bruth commented 6 years ago

Thank you. Having looked through the codebase before, I was going to start there anyway. I will trace my way back from the command entrypoint.

re: sig. Great I will try this out. I appreciate it.

richardlehane commented 6 years ago

Hi Byron - v1.7.9 released today now exports a decompress package that you should be able to use for your purposes. I left the siegreader package internal but exposed a public Reader() method on the siegreader.Buffer type. You can already get Buffers from the main siegfried package and with this new method you can now create io.Readers from those Buffers.

See this gist for a worked example along the lines of the ReadMember func you proposed

bruth commented 6 years ago

@richardlehane Thank you, this looks great and I really appreciate you adding support for it. I hope to give it a try tomorrow.