neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
250 stars 41 forks source link

use file format specific reader #41

Closed markacola closed 11 months ago

markacola commented 2 years ago

I ran into an issue where the IDisposable coming through to IsMatch was from a different IFileFormatReader.

All tests are currently passing but this does change the behaviour of potentially reading the stream multiple times based on the implementation of the IFileFormatReaders Read method.

neilharvey commented 2 years ago

My concern here is that reading the entire stream could have a negative impact on performance.

Here are differences in the benchmarks before and after the change:

Before

Method FileName Mean Error StdDev Median Gen 0 Allocated
DetermineFileFormat dragndrop.msg 509.1 us 10.08 us 22.55 us 509.4 us 26.3672 83 KB
DetermineFileFormat saved.msg 457.2 us 8.99 us 10.70 us 457.5 us 26.3672 82 KB
DetermineFileFormat test.doc 397.3 us 5.78 us 5.94 us 398.0 us 17.5781 55 KB
DetermineFileFormat test.docx 425.8 us 8.10 us 8.66 us 426.2 us 6.3477 19 KB
DetermineFileFormat test.odp 591.7 us 13.71 us 40.42 us 594.7 us 8.7891 27 KB
DetermineFileFormat test.ods 543.9 us 10.85 us 19.57 us 541.3 us 7.8125 25 KB
DetermineFileFormat test.odt 478.4 us 8.84 us 9.82 us 475.2 us 5.3711 18 KB
DetermineFileFormat test.ppt 451.9 us 8.30 us 7.76 us 451.1 us 18.0664 56 KB
DetermineFileFormat test.pptx 727.7 us 14.20 us 24.87 us 735.6 us 12.6953 39 KB
DetermineFileFormat test.vsd 404.1 us 9.28 us 27.37 us 392.1 us 17.5781 55 KB
DetermineFileFormat test.vsdx 451.2 us 6.49 us 6.07 us 451.1 us 7.8125 25 KB
DetermineFileFormat test.xls 373.4 us 2.81 us 2.63 us 373.5 us 16.1133 51 KB
DetermineFileFormat test.xlsx 401.6 us 3.34 us 3.12 us 401.7 us 5.8594 18 KB
DetermineFileFormat test.xps 406.4 us 4.58 us 4.29 us 405.9 us 5.8594 19 KB
DetermineFileFormat test.zip 362.0 us 3.05 us 2.85 us 361.5 us 3.9063 13 KB
DetermineFileFormat test2.ppt 380.4 us 2.61 us 2.44 us 379.1 us 17.5781 55 KB

After

Method FileName Mean Error StdDev Median Gen 0 Allocated
DetermineFileFormat dragndrop.msg 962.0 us 5.10 us 4.53 us 961.4 us 117.1875 364 KB
DetermineFileFormat saved.msg 1,104.8 us 19.97 us 17.71 us 1,112.2 us 115.2344 356 KB
DetermineFileFormat test.doc 668.9 us 29.47 us 86.89 us 641.2 us 71.2891 222 KB
DetermineFileFormat test.docx 872.5 us 28.28 us 83.39 us 904.6 us 16.6016 52 KB
DetermineFileFormat test.odp 1,169.4 us 28.45 us 83.87 us 1,196.1 us 29.2969 90 KB
DetermineFileFormat test.ods 966.9 us 4.04 us 7.48 us 965.8 us 25.3906 80 KB
DetermineFileFormat test.odt 817.8 us 27.00 us 79.60 us 827.0 us 13.6719 44 KB
DetermineFileFormat test.ppt 775.4 us 19.59 us 57.77 us 802.4 us 73.2422 226 KB
DetermineFileFormat test.pptx 1,579.8 us 31.33 us 59.61 us 1,597.3 us 48.8281 150 KB
DetermineFileFormat test.vsd 746.7 us 14.68 us 32.22 us 758.7 us 71.2891 221 KB
DetermineFileFormat test.vsdx 1,128.3 us 21.61 us 25.73 us 1,137.2 us 25.3906 81 KB
DetermineFileFormat test.xls 686.8 us 13.46 us 24.95 us 693.0 us 65.4297 204 KB
DetermineFileFormat test.xlsx 894.5 us 12.84 us 12.01 us 896.7 us 14.6484 46 KB
DetermineFileFormat test.xps 899.9 us 17.81 us 27.20 us 911.8 us 14.6484 47 KB
DetermineFileFormat test.zip 622.1 us 12.42 us 35.22 us 627.5 us 4.8828 17 KB
DetermineFileFormat test2.ppt 763.1 us 14.82 us 22.63 us 770.3 us 71.2891 222 KB

The memory allocation for Office documents is much higher and runs slower due to the multiple reads.

My original assumption here was that any matched IFileFormatReader would have the same signature, so for example if the matched readers were all zip-derived formats then we could read the zip file (once) which would be compatible with all of the matched readers.

Perhaps what we need to do instead is group the matched readers by some key. That way I could group all my zip formats together, read once, then find the matching format. In your use case you could just set different keys on your IFileFormatReader implementations so that you could entire a fresh read for each format.

Implementation could look something like this:

var groups = candidates
    .OfType<IFileFormatReader>()
    .GroupBy(x => x.Key)
    .ToList();

foreach(var group in groups)
{
    using var file = group[0].Read(stream);
    foreach(reader in group)
    {
         if (!reader.IsMatch(file))
        {
            candidates.Remove((FileFormat)reader);
        }       
    }
}

What are your thoughts?

JerryJian commented 11 months ago

I hope to merge this request as it may cause memory overflow and program crash.

markacola commented 11 months ago

From memory, the original reason for opening this was due to a matched file format needing to read the whole stream to confirm the type. I am no longer working in C# much however, unless there is an outstanding need for this change, I will close this PR.