Closed markacola closed 11 months ago
My concern here is that reading the entire stream could have a negative impact on performance.
Here are differences in the benchmarks before and after the change:
Method | FileName | Mean | Error | StdDev | Median | Gen 0 | Allocated |
---|---|---|---|---|---|---|---|
DetermineFileFormat | dragndrop.msg | 509.1 us | 10.08 us | 22.55 us | 509.4 us | 26.3672 | 83 KB |
DetermineFileFormat | saved.msg | 457.2 us | 8.99 us | 10.70 us | 457.5 us | 26.3672 | 82 KB |
DetermineFileFormat | test.doc | 397.3 us | 5.78 us | 5.94 us | 398.0 us | 17.5781 | 55 KB |
DetermineFileFormat | test.docx | 425.8 us | 8.10 us | 8.66 us | 426.2 us | 6.3477 | 19 KB |
DetermineFileFormat | test.odp | 591.7 us | 13.71 us | 40.42 us | 594.7 us | 8.7891 | 27 KB |
DetermineFileFormat | test.ods | 543.9 us | 10.85 us | 19.57 us | 541.3 us | 7.8125 | 25 KB |
DetermineFileFormat | test.odt | 478.4 us | 8.84 us | 9.82 us | 475.2 us | 5.3711 | 18 KB |
DetermineFileFormat | test.ppt | 451.9 us | 8.30 us | 7.76 us | 451.1 us | 18.0664 | 56 KB |
DetermineFileFormat | test.pptx | 727.7 us | 14.20 us | 24.87 us | 735.6 us | 12.6953 | 39 KB |
DetermineFileFormat | test.vsd | 404.1 us | 9.28 us | 27.37 us | 392.1 us | 17.5781 | 55 KB |
DetermineFileFormat | test.vsdx | 451.2 us | 6.49 us | 6.07 us | 451.1 us | 7.8125 | 25 KB |
DetermineFileFormat | test.xls | 373.4 us | 2.81 us | 2.63 us | 373.5 us | 16.1133 | 51 KB |
DetermineFileFormat | test.xlsx | 401.6 us | 3.34 us | 3.12 us | 401.7 us | 5.8594 | 18 KB |
DetermineFileFormat | test.xps | 406.4 us | 4.58 us | 4.29 us | 405.9 us | 5.8594 | 19 KB |
DetermineFileFormat | test.zip | 362.0 us | 3.05 us | 2.85 us | 361.5 us | 3.9063 | 13 KB |
DetermineFileFormat | test2.ppt | 380.4 us | 2.61 us | 2.44 us | 379.1 us | 17.5781 | 55 KB |
Method | FileName | Mean | Error | StdDev | Median | Gen 0 | Allocated |
---|---|---|---|---|---|---|---|
DetermineFileFormat | dragndrop.msg | 962.0 us | 5.10 us | 4.53 us | 961.4 us | 117.1875 | 364 KB |
DetermineFileFormat | saved.msg | 1,104.8 us | 19.97 us | 17.71 us | 1,112.2 us | 115.2344 | 356 KB |
DetermineFileFormat | test.doc | 668.9 us | 29.47 us | 86.89 us | 641.2 us | 71.2891 | 222 KB |
DetermineFileFormat | test.docx | 872.5 us | 28.28 us | 83.39 us | 904.6 us | 16.6016 | 52 KB |
DetermineFileFormat | test.odp | 1,169.4 us | 28.45 us | 83.87 us | 1,196.1 us | 29.2969 | 90 KB |
DetermineFileFormat | test.ods | 966.9 us | 4.04 us | 7.48 us | 965.8 us | 25.3906 | 80 KB |
DetermineFileFormat | test.odt | 817.8 us | 27.00 us | 79.60 us | 827.0 us | 13.6719 | 44 KB |
DetermineFileFormat | test.ppt | 775.4 us | 19.59 us | 57.77 us | 802.4 us | 73.2422 | 226 KB |
DetermineFileFormat | test.pptx | 1,579.8 us | 31.33 us | 59.61 us | 1,597.3 us | 48.8281 | 150 KB |
DetermineFileFormat | test.vsd | 746.7 us | 14.68 us | 32.22 us | 758.7 us | 71.2891 | 221 KB |
DetermineFileFormat | test.vsdx | 1,128.3 us | 21.61 us | 25.73 us | 1,137.2 us | 25.3906 | 81 KB |
DetermineFileFormat | test.xls | 686.8 us | 13.46 us | 24.95 us | 693.0 us | 65.4297 | 204 KB |
DetermineFileFormat | test.xlsx | 894.5 us | 12.84 us | 12.01 us | 896.7 us | 14.6484 | 46 KB |
DetermineFileFormat | test.xps | 899.9 us | 17.81 us | 27.20 us | 911.8 us | 14.6484 | 47 KB |
DetermineFileFormat | test.zip | 622.1 us | 12.42 us | 35.22 us | 627.5 us | 4.8828 | 17 KB |
DetermineFileFormat | test2.ppt | 763.1 us | 14.82 us | 22.63 us | 770.3 us | 71.2891 | 222 KB |
The memory allocation for Office documents is much higher and runs slower due to the multiple reads.
My original assumption here was that any matched IFileFormatReader
would have the same signature, so for example if the matched readers were all zip-derived formats then we could read the zip file (once) which would be compatible with all of the matched readers.
Perhaps what we need to do instead is group the matched readers by some key. That way I could group all my zip formats together, read once, then find the matching format. In your use case you could just set different keys on your IFileFormatReader
implementations so that you could entire a fresh read for each format.
Implementation could look something like this:
var groups = candidates
.OfType<IFileFormatReader>()
.GroupBy(x => x.Key)
.ToList();
foreach(var group in groups)
{
using var file = group[0].Read(stream);
foreach(reader in group)
{
if (!reader.IsMatch(file))
{
candidates.Remove((FileFormat)reader);
}
}
}
What are your thoughts?
I hope to merge this request as it may cause memory overflow and program crash.
From memory, the original reason for opening this was due to a matched file format needing to read the whole stream to confirm the type. I am no longer working in C# much however, unless there is an outstanding need for this change, I will close this PR.
I ran into an issue where the
IDisposable
coming through toIsMatch
was from a differentIFileFormatReader
.All tests are currently passing but this does change the behaviour of potentially reading the stream multiple times based on the implementation of the
IFileFormatReader
sRead
method.