richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
217 stars 30 forks source link

container matching doesn't work with directory-only paths: fmt/1196 (SIARD) #123

Closed ross-spencer closed 5 years ago

ross-spencer commented 5 years ago

Hi Richard,

As discussed I've attached what I think to be an accurate skeleton container file for fmt/1196. What do you think? (Also noted that SF might not have this capability to identify either as of yet!)

(Unzipping should provide you a SIARD zip)

fmt-1196-container-signature-id-31020.zip

cc. @Dclipsham. I can't get it to pass in DROID either, but l am reluctant to say the problem is there given the recent proximity of the DROID release. PS. David - If there were public samples that could be shared, that would be a great help!

Dclipsham commented 5 years ago

Hi Ross, My test samples for building the signature came from the SFA GitHub - https://github.com/sfa-siard/SiardCmd/tree/master/testfiles - off hand I cannot see anything 'wrong' with your file, but it isn't IDing for me either. Note that in the SFA files the '2.1' directory is empty but in yours I can see a zero-byte file - possibly related?

ross-spencer commented 5 years ago

@Dclipsham it sounds like there is a good chance of that! Thanks for giving the file a once-over. I created the new logic in a bit of a rush and writing a zip with the zero-byte file was the easiest way forward. I wasn't expecting it to impact the identification, but it could I guess? - Will try without.

richardlehane commented 5 years ago

👍 @ross-spencer - another notch on the skeleton suite's belt! I hadn't envisioned this type of signature (matching on a directory rather than file) and will need to do a bit of a re-work of my container matching. So this will likely be a bug until the next major release. 👍 @Dclipsham for that extra detail, is good to know that these directories may be empty. 👎 siard team ;), what kind of insanity is this: image

marhop commented 5 years ago

Hi guys,

I think this is a tricky issue with the skeleton file, not with Siegfried or DROID or the SIARD specification.

The SIARD ZIP provided by Ross contains just one entry:

Extracting this ZIP file creates the expected directory tree with the empty file at the bottom, yes. But still, inside the ZIP file there is only one entry with this unusual name containing slashes. That's different from the entry 'header/siardversion/2.1/' expected by the PRONOM signature. Although this directory is created when extracting the ZIP file, there is no entry of this name in the ZIP file.

Compare this to the attached ZIP file which is corretly identified as fmt/1196 (by DROID at least). It contains the following four entries, amongst them the one expected by the signature:

The differences between the two ZIP files can best be perceived when feeding them to the zipinfo tool or viewing them in a hex editor. (In the latter case, you may find this commented ZIP file example useful for orientation.)

Anyway, it should be noted that although the added file does not impede identification, it may lead to validation failure since the SIARD specification explicitly demands an empty directory. But I suppose we shouldn't worry about validation of skeleton files, so never mind. ;-)

PS: It stands to reason whether one of the two approaches to zipping nested directories is "more correct" than the other. I couldn't find anything specific in the ZIP format specification except that forward slashes are allowed in a file name (see section 4.4.17.1).

richardlehane commented 5 years ago

thanks very much for this extra detail @marhop. I tested your sample with siegfried and it fails to ID properly, so a problem there too. Also, same issue with fido from my quick check.

marhop commented 5 years ago

It does not? Hm, weird. I used DROID 6.4, signature file 94, container signature file 20180920 with 'open archives' option set to false. Does siegfried correctly identify a "real" SIARD 2.1 file like the one linked by David?

PhillipTommerholt commented 5 years ago

Hi guys,

There are many aspects in this thread that I am not sure about, so please disregard if my comment has no relevance.

First thing I notice in the ZIP is that there is no content-folder. See P_4.2-1 requirement in SIARD2.1

image

Can this help?

richardlehane commented 5 years ago

@marhop good point - I checked the SIARD 2.1 examples referenced and they don't identify either. Is clearly an assumption I built into siegfried's container matching that I'll just need to rework. fido is also failing for these too so it might have a similar issue with its container algo

@PhillipTommerholt - thanks for this extra detail. It shouldn't matter in this context that the content folder is missing as the container signature only includes that header../ path

siard team - sorry for my little dig above, 👍 for building a very useful digipres tool!

ross-spencer commented 5 years ago

Hi @marhop

PS: It stands to reason whether one of the two approaches to zipping nested directories is "more correct" than the other. I couldn't find anything specific in the ZIP format specification except that forward slashes are allowed in a file name (see section 4.4.17.1).

Yeah, I can see the difference when I unar in Linux. It wasn't something I was looking for! So thank you. I am relatively sure I can make this change using Python's ZIP library 🤞.

I'll see if I can remove the empty file too... that was just the first way the library was happy to create the nested structure in an otherwise empty zip... let's see what we can do!

Cheers, Ross

ross-spencer commented 5 years ago

This was a weird one to fix in Python.

I think that shutil was maybe meant to work given these issues (I really can't tell):

But I've modified the code to use ZipFile now and it looks good. Directory entries are being created as we would like:

zipinfo *
Archive:  fmt-1196-container-signature-id-31020.siard
Zip file size: 352 bytes, number of entries: 3
drwxr-xr-x  2.0 unx        0 b- stor 18-Oct-06 14:12 header/
drwxr-xr-x  2.0 unx        0 b- stor 18-Oct-06 14:12 header/siardversion/
drwxr-xr-x  2.0 unx        0 b- stor 18-Oct-06 14:12 header/siardversion/2.1/
3 files, 0 bytes uncompressed, 0 bytes compressed:  0.0%

The branch is here, but I might do some proper maintenance on it before I commit to master: https://github.com/exponential-decay/skeleton-container-test-suite-generator/tree/dev/fix-zip-generation

And the resultant SIARD skeleton file for reference (zipped):

fmt-1196-container-signature-id-31020.zip