h2non / filetype

Fast, dependency-free Go package to infer binary file types based on the magic numbers header signature
https://pkg.go.dev/github.com/h2non/filetype?tab=doc
MIT License
2.05k stars 178 forks source link

Full file needed for Documents #107

Open onetwopunch opened 2 years ago

onetwopunch commented 2 years ago

The README specifically states:

Only first 262 bytes representing the max file header is required, so you can just pass a slice

I've tried this out and it works fine for all files except MS Office docs such as docx, xlsx, etc. These files have a kind of application/zip if given only the first 262 bytes, but if you give them the full file, either with MatchFile or MatchReader they are detected correctly.

In fact, each file type seems to have a different buffer length minimum for filetype to report accurately. docx only seems to require a minimum of 1750 bytes, .xlsm requires at minimum of 1855 bytes. For each of these files, a buffer length under this amount will inaccurately report application/zip. For my application, this is very important.

For now I'll have to do the work of determining the minimum buffer size for MSO files to report accurately, but if you know this already, please update the docs, or at least have a caveat around the 262 number.

astrocox commented 2 years ago

I'm having a similar issue, I have a valid .xlsx file that is not correctly identified from the first 262 bytes. The first header check succeeds but the second fails because the second P K 0x03 0x04 signature doesn't show up until byte 996 in my file, which is way past the end of my 262 byte slice. It would be nice if there was more documentation around the min/max required bytes to identify defined types.

astrocox commented 2 years ago

This is probably a duplicate of #83

onetwopunch commented 2 years ago

This isn't really a duplicate since the header size of 262 is only applicable for non-Microsoft files, so the docs are actually wrong. I've since determined the only way to accurately determine the content type of MSO files is to read in the entire buffer or hard code a map based on file extension. In case anyone else wants it, this may save you some time:

var MicrosoftExtMap = map[string]string{
    ".doc":  "application/msword",
    ".dot":  "application/msword",
    ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    ".dotx": "application/vnd.openxmlformats-officedocument.wordprocessingml.template",
    ".docm": "application/vnd.ms-word.document.macroEnabled.12",
    ".dotm": "application/vnd.ms-word.template.macroEnabled.12",
    ".xls":  "application/vnd.ms-excel",
    ".xlt":  "application/vnd.ms-excel",
    ".xla":  "application/vnd.ms-excel",
    ".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    ".xltx": "application/vnd.openxmlformats-officedocument.spreadsheetml.template",
    ".xlsm": "application/vnd.ms-excel.sheet.macroEnabled.12",
    ".xltm": "application/vnd.ms-excel.template.macroEnabled.12",
    ".xlam": "application/vnd.ms-excel.addin.macroEnabled.12",
    ".xlsb": "application/vnd.ms-excel.sheet.binary.macroEnabled.12",
    ".ppt":  "application/vnd.ms-powerpoint",
    ".pot":  "application/vnd.ms-powerpoint",
    ".pps":  "application/vnd.ms-powerpoint",
    ".ppa":  "application/vnd.ms-powerpoint",
    ".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
    ".potx": "application/vnd.openxmlformats-officedocument.presentationml.template",
    ".ppsx": "application/vnd.openxmlformats-officedocument.presentationml.slideshow",
    ".ppam": "application/vnd.ms-powerpoint.addin.macroEnabled.12",
    ".pptm": "application/vnd.ms-powerpoint.presentation.macroEnabled.12",
    ".potm": "application/vnd.ms-powerpoint.template.macroEnabled.12",
    ".ppsm": "application/vnd.ms-powerpoint.slideshow.macroEnabled.12",
}

func MicrosoftContentType(filename string) (string, bool) {
    ext := filepath.Ext(filename)
    if contentType, ok := MicrosoftExtMap[ext]; ok {
        return contentType, true
    }
    return "", false
}

func ContentType(filename string, header []byte) string {
    if contentType, ok := MicrosoftContentType(filename); ok {
        return contentType
    }
        kind, _ := filetype.Match(header)
    return kind.MIME.Value
}