h2non / filetype.py

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature
https://h2non.github.io/filetype.py
MIT License
650 stars 111 forks source link

Docx detected as Zip due to trash files #171

Open lucasgadams opened 7 months ago

lucasgadams commented 7 months ago

A few specific files that are proper docx type are being detected as zip. I looked into it and the current code checks for a matching mime type identifier in the beginning of the buffer, checking the first document in the zipped file. However as recently pointed out in the magic library (here), it is possible and valid to have trash documents/bytes anywhere in the zipped file, including the first document. The fix as noted in that link is that you need to skip over these trash bytes. Could we get that fix ported to this library?

lucasgadams commented 7 months ago

This was specifically fixed in the linux file command last year in this commit