sindresorhus / file-type

Detect the file type of a file, stream, or data
MIT License
3.73k stars 354 forks source link

file-type treat libreoffice saved .pptx as .zip #119

Closed notedit closed 6 years ago

notedit commented 6 years ago

i just check that, it is easy to reproduct

sindresorhus commented 6 years ago

Can you submit a PR with a failing test? Otherwise, it's not much we can do about it.

reviewher commented 6 years ago

The msooxml magic is broken because the regex matches are not pinned to the initial offset. The second rule is:

# make sure the first file is correct
>0x1E       regex       \\[Content_Types\\]\\.xml|_rels/\\.rels

The comment is incorrect. The entire file, starting from the 0x1E offset, is searched for that string.

@sindresorhus In a related issue I included an example that shows how the original magic check is broken.

The "correct" way to do the zip checks is to jump to the central directory and look at the file names.

ping @thejoshwolfe -- any thoughts on the most efficient way to scan a zip file's entry names?

thejoshwolfe commented 6 years ago

from the readme:

It only needs the first 4100 bytes.

This is not adequate for correctly detecting a zip file, because the starting point for the zip file format is near the end of the file, not the beginning.

Here's some discussion from https://en.wikipedia.org/wiki/Zip_(file_format) :

A directory is placed at the end of a .ZIP file. This identifies what files are in the .ZIP and identifies where in the .ZIP that file is located. This allows .ZIP readers to load the list of files without reading the entire .ZIP archive. .ZIP archives can also include extra data that is not related to the .ZIP archive. This allows for a .ZIP archive to be made into a self-extracting archive (application that decompresses its contained data), by prepending the program code to a .ZIP archive and marking the file as executable. Storing the catalog at the end also makes possible hiding a zipped file by appending it to an innocuous file, such as a GIF image file.

So in the case of a self-extracting zip, there are actually two correct answers for "what type of file is this?": both "exe" and "zip".

You may be able to get away with reading near the beginning of the file for file types with a more strictly constrained file type that is based on .zip, but i don't know of any file types that do this. I expect most of them just say "it's zip with files that match these patterns.", e.g. JAR.

If you want to correctly identify if something is a zipfile, here's what you do (reference https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT ):

  1. Scan backwards from offset file_size - 22 to file_size - 0xffff - 22 for the magic number [0x50, 0x4b, 0x05, 0x06]. You gotta scan instead of checking a static offset because there might be a zip file comment after the signature. GitHub project downloads do this, for example (Clone or download -> Download ZIP).

Then if you want to correctly detect if files of a certain name are in the archive (as opposed to just being mentioned in file contents, i.e. in the source code for this project), then decide if you want to support ZIP64 format. I suppose you don't since the api for this project reads from a ram buffer. So with no ZIP64 support:

  1. Read the 32-bit LE integer called "Offset of start of central directory, relative to start of archive" located +16 bytes ahead of the signature found in step 1.
  2. Read the 16-bit LE integer called "Total number of central directory records on this disk" located +10 bytes ahead of the signature found in step 1.
  3. Seek to the offset found in step 2:
  4. If the number of entries you've read is equal to the total number found in step 3, stop.
  5. Read the three 16-bit LE sizes located at +28, +30, and +32 bytes ahead of your cursor. the first is the file name size.
  6. Read the file name starting +46 bytes ahead of your cursor and lasting a number of bytes determined by the file name size found in step 6.
  7. Seek ahead +46 bytes + the three sizes found in step 6.
  8. Goto step 5.

Then to be even more correct, you can check the general purpose bit flag for bit 11, which indicates UTF-8 vs CP437 encoding for the file name. You probably don't need that though, since the magic file names you'll be looking for are probably going to be the same in those two encodings.

Then to be more correct, you may want to check for the Info-ZIP Unicode Path Extra Field which can override the file name, for some reason.

Then maybe you might want to convert \ to / in the file names. I'm currently trying to figure out when this is appropriate over in my software project, so I can't give definitive advice on this one. https://github.com/thejoshwolfe/yauzl/issues/65 https://github.com/thejoshwolfe/yauzl/issues/66

In summary: if a file format is "just zip plus ...", then it's actually an extremely complicated and convoluted format that takes a nearly full-featured parser to detect correctly. Give me any quick-and-dirty detector, and i'll give you an example file that fools it. (Looks like you've already got an example for this issue, but I can provide more if you like.)

You might consider using https://github.com/thejoshwolfe/yauzl (i.g. yauzl.fromBuffer) to parse the file, but this is incompatible with your current api in two ways: 1) it requires the entire file, or at least random access to the entire file, not just the beginning of the file. 2) yauzl's api is asynchronous, and your api is synchronous.

I don't have an easy solution for you. The approach started in this project is fundamentally incapable of correctly detecting zip-based formats. You could change your api, or you could accept that there will be unsolvable bugs like this issue.