neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
258 stars 40 forks source link

"application/vnd.ms-powerpoint" type falcely detected for some msg files #7

Closed Michel20367 closed 6 years ago

Michel20367 commented 6 years ago

If a .Msg file is saved from Outlook via drag and drop, the Root Entry ( 0x52, 0x00, 0x6F, 0x00, 0x6F, 0x00, 0x74, 0x00, 0x20, 0x00, 0x45, 0x00, 0x6E, 0x00, 0x74, 0x00, 0x72, 0x00, 0x79, 0x00 ) is on the offset 0x400 and not 0x200 as usual. At offset 0x200 there is the sequence (0xFD, 0xFF, 0xFF, 0xFF) which you use for detection of "application/vnd.ms-powerpoint", "ppt" files. Consequently, such .msg files (see example in attachment) are incorrectly detected as "ppt" files.

Best Regards

Michael Ioshchikhes

Externe Telefonate.zip

neilharvey commented 6 years ago

Hey, thanks for the detailed bug report. I'll do some investigation and implement a fix.

Michel20367 commented 6 years ago

Hi I have researched a bit further, the PowerPoint file. It has at offset 0x480 the byte sequence "PowePoint Document". (0x50,0x00,0x6f,0x00,0x77,0x00,0x65,0x00,0x72,0x00,0x50,0x00,0x6f,0x00,0x69,0x00,0x6e,0x00,0x74,0x00,0x20,0x00,0x44,0x00,0x6f,0x00,0x63,0x00,0x75,0x00,0x6d,0x00,0x65,0x00,0x6e,0x00,0x74,0x00) At the offset 0x200 the sequence is "ýÿÿÿ§" (0xFD, 0xFF, 0xFF, 0xFF, 0xFF, 0xA7), but I'm not sure if it is available for all files. There is another variant of ppt file (see attachment). If it was created in one of the old Office versions, it has at the offset 0x200 sequence .n.ð(0x00, 0x6E , 0x1E, 0x F0) An example can be found in the attachment. pptx-test_comp.zip Good Link: https://www.garykessler.net/library/file_sigs.html I hope this helps.

neilharvey commented 6 years ago

It looks as though there are a few variations for the PowerPoint subheader, which are:

00 6E 1E F0 0F 00 E8 03 A0 46 1D F0 FD FF FF FF 0E 00 00 00 FD FF FF FF 1C 00 00 00 FD FF FF FF 43 00 00 00

I'm catching the last three with FD FF FF FF, I assume the first three are from an older version of PowerPoint. Most likely I'll change the PowerpointLegacy class to catch all different signatures if I can't determine the individual formats.

neilharvey commented 6 years ago

I've implemented the Compound File Binary format (or at least, the header part) which should solve this issue. How this works is that the root entry can be located at different positions in the file, which we can determine by reading the CFB header and checking the first directory sector location. Once we have the root entry, we can read the object type CLSID which allows us to determine the type of file.

For some reason, when saving a message from Outlook the first directory sector location is set to 1, but when using drag-and-drop it is set to 2. No idea why Outlook saves the files like that, but checking the CFB header allows for both cases to be handled :)

I've pushed a prerelease to NuGet, have a look and let me know if it works for you.

Michel20367 commented 6 years ago

Thank you! I will check it this week.

Michel20367 commented 6 years ago

I tested the 2.0 rc with my sample files. Everything is recognized correctly. Good work!

neilharvey commented 6 years ago

Thanks for checking! I'll publish the release version over the weekend.