h2non / filetype.py

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature
https://h2non.github.io/filetype.py
MIT License
659 stars 113 forks source link

Incorrect handling of CR2 files #69

Open kostrub opened 4 years ago

kostrub commented 4 years ago

Hello. I have a problem when trying to process Cr2 files. filetype recognize it as both tiff and cr2 type. It's not surprise since cr2 basen on tiff .

Filetype version 1.0.7 Sample code:

from filetype.types.image import Tiff, Cr2
from filetype import match
match("Path to cr2 file", matchers=[Cr2()])
match("Path to cr2 file", matchers=[Tiff()])

Result is:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])
<filetype.types.image.Tiff object at 0x0000029F0EF2CA20>

Should be:

>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Cr2()])
<filetype.types.image.Cr2 object at 0x0000029F0EF2C9E8>
>>> match("D:\Downloads\RAW_CANON_1DM2.CR2", matchers=[Tiff()])

You can take sample cr2 here I think to solve this problem we need to add something like and not(buf[8] == 0x43 and buf[9] == 0x52) here to make sure that there is no Cr2 magic word in buffer.

h2non commented 4 years ago

Happy to merge a PR with the fix.

dosas commented 3 years ago

This problem could have never been caught with a simple test. Because of this return statement https://github.com/h2non/filetype.py/blob/master/filetype/match.py#L33 the matches (and probably also the performance) depend on the position of the Type class in the list. Shouldn't the matcher iterate over all possible types and throw an error if more than one match is found?