trailofbits / polyfile

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer
Apache License 2.0
339 stars 22 forks source link

Differential Testing Against `file` #3373

Closed ESultanik closed 2 years ago

ESultanik commented 2 years ago

This PR adds auto-generated differential tests to compare the output of PolyFile against file/libmagic using 900+ files from Ange Albertini's Corkami corpus.

This has revealed several bugs both in PolyFile and libmagic, the former of which have been fixed in this PR. In particular, PolyFile's handling of libmagic regular expressions was faulty.

This PR also includes several improvements to PolyFile's interactive debugger, which were implemented in order to investigate the differentials.

Out of the 942 files in the Corkami corpus, PolyFile now matches at least as many MIME types as file for all but two files. One of those discrepancies is due to an incorrect classification on the part of file, and the other discrepancy is due to an incorrect classification on the part of PolyFile.