richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

PPT identified as XLS through SF in latest sig/build #52

Closed ross-spencer closed 9 years ago

ross-spencer commented 9 years ago

N.B. I haven't had the time to investigate this for myself tonight else I would have taken a look. In my comparison of DROID outputs in SF vs. DROID, I found:

https://github.com/ross-spencer/opf-format-corpus/blob/master/format-corpus/office-examples/powerpoint4-mac/unc-%20oxford.ppt

Is identified in DROID v82 signature as x-fmt/88 PPT4 by Signature identification.

In SF:


siegfried : 1.3.0 scandate : 2015-08-27T20:27:21+12:00 signature : pronom.sig created : 2015-08-27T19:32:17+12:00 identifiers :

richardlehane commented 9 years ago

Hmmm not sure what to do here...

The problem is SF finds the fmt/59 byte signature first and doesn't bother looking further as there isn't a superior byte match. A fundamental thing about the way SF works is that it will try to stop as soon as it can, by doing priority testing as it goes, and if it can't stop early it will scan the whole file. DROID does priority testing at the end - so will have noted both the excel and ppt byte sigs, but preferences ppt because of the extension match.

The crazy thing is if you remove the fmt/59 signature (roy build -exclude fmt/59) you don't get the ppt match ... you get a fmt/39 MS Word match! It matches that signature too! This is some kind of chameleon file - it doesn't open for me but I'd be really interested to see what it actually contains.

If you do roy build -exclude fmt/59,fmt/39 you'll get the ppt match.

The fix may be to remove the fmt/59 and fmt/39 byte signatures altogether: both of these formats have container signatures as well so may be redundant and presumably the container sigs are more reliable?

richardlehane commented 9 years ago

Unpacked the file as a compound object, and yes it seems to contain a bunch of other microsoft office objects: capn

Given that DROID is ultimately only beating SF because of an extension match (!), suggest best way forward is to fix this with changes to signatures. Adding a container signature for x-fmt/88 would be the cleanest fix.

It might also be worth suggesting removing all the byte signatures from PRONOM where there are container signatures. This would speed up identification and prevent these kind of mistakes.

richardlehane commented 9 years ago

Another useful data point: if you turn priority filtering off, you see that four byte signatures match this file, including multiple matches for MS Word (all those extra basis blocks):

capture

Suggest the best fix for siegfried will be to add a -nodouble flag to roy that will prevent doubling up of container and byte signatures (i.e. if there is both a byte and a container signature for any particular format, exclude the redundant byte signature from the bytematcher).

Ultimately it would also be good to get a PPT 4 container signature in PRONOM as this is a more reliable way to match this kind of file.

ross-spencer commented 9 years ago

I agree about the PPT container signature - that would be the best position. And I agree about the removal of the traditional byte sequence too if there is a container. Though that does mean there isn't a fallback position for the file, but that's not necessarily a problem. (For a short while back in the day, it was partially to support DROID 3 and 4 which were still in use - I do still use 3/4 to bug-fix the skeleton suite)

No-double in SF sounds like a good idea too.

I'm just wondering if there is something that isn't quite working, either in DROID or SF. DROID doesn't use extensions as a method of making concrete an ID so it wouldn't match Excel and PPT and match PPT because of the extension.

The priorities are in the first page of the PRONOM report http://www.nationalarchives.gov.uk/PRONOM/x-fmt/88:

Has priority over Microsoft Powerpoint Presentation (97-2003) Has priority over OLE2 Compound Document Format
Is previous version of Microsoft Powerpoint Presentation (95)

So I'm wondering if:DROID isn't returning a multiple ID where it should (for this chameleon object!). Unless there's a byteseek heuristic I'm not aware off, and that might be one to check on the droid-list.

richardlehane commented 9 years ago

OK that is weird: all those matches sf is getting definitely are in the file (you can check in a hex editor) so I assumed Droid must just give a little extra weight to ppt because of its additional extension match. But if that isn't it, not sure what it is doing - unless it skips byte sigs where matching container sigs have already been checked?

But I will call this "fixed" when I add the -nodouble flag (which I may make a default)!

richardlehane commented 9 years ago

Hi Ross I've updated roy so it won't double up on byte and container signatures anymore (unless a -doubleup flag is given & it will). This "fixes" this particular misidentification but as discussed best fix would be to get a new container signature added.