use file format specific reader

markacola commented 2 years ago

I ran into an issue where the IDisposable coming through to IsMatch was from a different IFileFormatReader.

All tests are currently passing but this does change the behaviour of potentially reading the stream multiple times based on the implementation of the IFileFormatReaders Read method.

neilharvey commented 2 years ago

My concern here is that reading the entire stream could have a negative impact on performance.

Here are differences in the benchmarks before and after the change:

Before

Method	FileName	Mean	Error	StdDev	Median	Gen 0	Allocated
DetermineFileFormat	dragndrop.msg	509.1 us	10.08 us	22.55 us	509.4 us	26.3672	83 KB
DetermineFileFormat	saved.msg	457.2 us	8.99 us	10.70 us	457.5 us	26.3672	82 KB
DetermineFileFormat	test.doc	397.3 us	5.78 us	5.94 us	398.0 us	17.5781	55 KB
DetermineFileFormat	test.docx	425.8 us	8.10 us	8.66 us	426.2 us	6.3477	19 KB
DetermineFileFormat	test.odp	591.7 us	13.71 us	40.42 us	594.7 us	8.7891	27 KB
DetermineFileFormat	test.ods	543.9 us	10.85 us	19.57 us	541.3 us	7.8125	25 KB
DetermineFileFormat	test.odt	478.4 us	8.84 us	9.82 us	475.2 us	5.3711	18 KB
DetermineFileFormat	test.ppt	451.9 us	8.30 us	7.76 us	451.1 us	18.0664	56 KB
DetermineFileFormat	test.pptx	727.7 us	14.20 us	24.87 us	735.6 us	12.6953	39 KB
DetermineFileFormat	test.vsd	404.1 us	9.28 us	27.37 us	392.1 us	17.5781	55 KB
DetermineFileFormat	test.vsdx	451.2 us	6.49 us	6.07 us	451.1 us	7.8125	25 KB
DetermineFileFormat	test.xls	373.4 us	2.81 us	2.63 us	373.5 us	16.1133	51 KB
DetermineFileFormat	test.xlsx	401.6 us	3.34 us	3.12 us	401.7 us	5.8594	18 KB
DetermineFileFormat	test.xps	406.4 us	4.58 us	4.29 us	405.9 us	5.8594	19 KB
DetermineFileFormat	test.zip	362.0 us	3.05 us	2.85 us	361.5 us	3.9063	13 KB
DetermineFileFormat	test2.ppt	380.4 us	2.61 us	2.44 us	379.1 us	17.5781	55 KB

After

Method	FileName	Mean	Error	StdDev	Median	Gen 0	Allocated
DetermineFileFormat	dragndrop.msg	962.0 us	5.10 us	4.53 us	961.4 us	117.1875	364 KB
DetermineFileFormat	saved.msg	1,104.8 us	19.97 us	17.71 us	1,112.2 us	115.2344	356 KB
DetermineFileFormat	test.doc	668.9 us	29.47 us	86.89 us	641.2 us	71.2891	222 KB
DetermineFileFormat	test.docx	872.5 us	28.28 us	83.39 us	904.6 us	16.6016	52 KB
DetermineFileFormat	test.odp	1,169.4 us	28.45 us	83.87 us	1,196.1 us	29.2969	90 KB
DetermineFileFormat	test.ods	966.9 us	4.04 us	7.48 us	965.8 us	25.3906	80 KB
DetermineFileFormat	test.odt	817.8 us	27.00 us	79.60 us	827.0 us	13.6719	44 KB
DetermineFileFormat	test.ppt	775.4 us	19.59 us	57.77 us	802.4 us	73.2422	226 KB
DetermineFileFormat	test.pptx	1,579.8 us	31.33 us	59.61 us	1,597.3 us	48.8281	150 KB
DetermineFileFormat	test.vsd	746.7 us	14.68 us	32.22 us	758.7 us	71.2891	221 KB
DetermineFileFormat	test.vsdx	1,128.3 us	21.61 us	25.73 us	1,137.2 us	25.3906	81 KB
DetermineFileFormat	test.xls	686.8 us	13.46 us	24.95 us	693.0 us	65.4297	204 KB
DetermineFileFormat	test.xlsx	894.5 us	12.84 us	12.01 us	896.7 us	14.6484	46 KB
DetermineFileFormat	test.xps	899.9 us	17.81 us	27.20 us	911.8 us	14.6484	47 KB
DetermineFileFormat	test.zip	622.1 us	12.42 us	35.22 us	627.5 us	4.8828	17 KB
DetermineFileFormat	test2.ppt	763.1 us	14.82 us	22.63 us	770.3 us	71.2891	222 KB

The memory allocation for Office documents is much higher and runs slower due to the multiple reads.

My original assumption here was that any matched IFileFormatReader would have the same signature, so for example if the matched readers were all zip-derived formats then we could read the zip file (once) which would be compatible with all of the matched readers.

Perhaps what we need to do instead is group the matched readers by some key. That way I could group all my zip formats together, read once, then find the matching format. In your use case you could just set different keys on your IFileFormatReader implementations so that you could entire a fresh read for each format.

Implementation could look something like this:

var groups = candidates
    .OfType<IFileFormatReader>()
    .GroupBy(x => x.Key)
    .ToList();

foreach(var group in groups)
{
    using var file = group[0].Read(stream);
    foreach(reader in group)
    {
         if (!reader.IsMatch(file))
        {
            candidates.Remove((FileFormat)reader);
        }       
    }
}

What are your thoughts?

JerryJian commented 11 months ago

I hope to merge this request as it may cause memory overflow and program crash.

markacola commented 11 months ago

From memory, the original reason for opening this was due to a matched file format needing to read the whole stream to confirm the type. I am no longer working in C# much however, unless there is an outstanding need for this change, I will close this PR.

neilharvey / FileSignatures

use file format specific reader #41

Before

After