neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
250 stars 41 forks source link

PowerPoint Presentation (97-2003 Presentation format) is not recognized #9

Closed mstavrev closed 5 years ago

mstavrev commented 5 years ago

I am attaching a PPT PowerPoint presentation file that for some reason is not identified by the library.

testPPT_mit.zip

It is a simple presentation, created to test the functionality of the library. I noticed that using just two slides works fine.

neilharvey commented 5 years ago

It's failing because the code is looking for FD FF FF FF as the subheader (at position 0x200) whereas your sample has 0F 00 E8 03. There appears to be a few possibilities for PowerPoint files.

One possibility is change the class to look for all the different possible identifiers and return a match if any of them were found. I'm also looking into implementing a more thorough implementation of the Compound Binary File format to retrieve the CLSID to fix #7 which will hopefully be a better long-term solution.

mstavrev commented 5 years ago

Thanks for the update.

Looks like you've listed (on #7) all of the mentioned sequencing for PPT that I can also find on this list https://www.garykessler.net/library/file_sigs.html

It would be best if 1st a detection of MS-CFB format is performed (the header is at offset 0 and it is pretty long and unique to be a false positive), then look for the CLSID of PowerPoint, which if understand correctly should be {64818d10-4f9b-11cf-86ea-00aa00b929e8}. I followed the information available here http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File and I did locate this CLSID at least on my file.

If you follow the link for the PPT on that wiki, you can also get a collection of old PPT files that can be used for additional testing: https://web.archive.org/web/20020313074855/http://ftp.sunet.se/pub/Internet-documents/isoc/charts/presentations/

Cheers

neilharvey commented 5 years ago

Hey, I've implemented the CFB format and rewritten all the legacy Office types to use it instead of the subheader check. It now correctly identifies your attached sample.

I've published a prerelease version to NuGet, if all looks good I've push a final release in the next day or so.

mstavrev commented 5 years ago

Thanks for the update. I've updated to 2.0.0-rc and now can see the library correctly identifying the problematic file. I've also did a few quick tests with Excel and Word documents saved to the old format that also work as expected.