h2non / filetype.py

Small, dependency-free, fast Python package to infer binary file types checking the magic numbers signature
https://h2non.github.io/filetype.py
MIT License
629 stars 109 forks source link

Use a file signatures table to speed up the file type recognition #13

Open vuolter opened 7 years ago

vuolter commented 7 years ago

I think that pre-build a dict and put there all the magic signatures for the file header lookup is more time efficient than call time to time each type object to find the matching file header.

h2non commented 7 years ago

That might be true, but without metrics we don't know. Also, IMO the library is considerably fast enough for the 99% of the use cases. Why do you care of excellent performance here? What's currently impacting you?

vuolter commented 7 years ago

I tried with a cluster of several thousand of files and performances wasn't so great, but, I admit, mine was a case very at the edge. :p

h2non commented 7 years ago

Interesting... my impression is that this is a CPython limitation, more than an implementation performance issue, but we can try improving things. If you can lead this by preparaing some performance test suites scenarios that I can easily reproduce, that would be great.

vuolter commented 7 years ago

Hi, preparing a general performance test suite is a bit difficult here because of the nature of the phisycal medium on which the test will be performed. If we try to process in parallel so many files stored on a single HDD, then its I/O limit will be reached very quickly, but if all the files would be splitted in more SSDs, then the result should more less limited by drive performances.

h2non commented 7 years ago

I would suggest that a performance for this scenarios test should not involve any I/O at all. That would make the performance testing goal inaccurate, and therefore irrelevant.

Instead, the performance suite should only cover the boundaries of the actual code logic to measure. In this context that would imply passing a binary buffer representing the file signature, up to 256 bytes. That's all you need, no disk I/O impact here.

vuolter commented 6 years ago

Ok, I'll try to preprare a draft of the new code and make a PR, ;)

ghost commented 5 years ago

Magic bytes don't work for complex container types like ISO-BMFF (MP4, MOV, HEIF/HEIC) and Matroska (MKV, WEBM). The headers need to be parsed to determine the format.