Closed agerchev closed 4 years ago
I've set up some benchmark tests, and it looks like the culprit may be the CompoundFileBinary
format, which is used to read legacy Office formats. It looks like when we instantiate CompoundFile
it causes a spike in memory usage. I need to do more diagnostics to get a clearer picture, then go from there.
@neilharvey we are using it in a service for uploading a lot of files and the profiler shows excessive memory allocations in the method. we can simulate it with code like this one:
FileFormatInspector inspector = new FileFormatInspector();
for (int i = 0; i < 10000; i++)
{
using (var stream = File.OpenRead(@"C:\Tmp\test.pdf"))
{
FileFormat detectedUploadedFileType = inspector.DetermineFileFormat(stream);
}
}
here is the profiling info:
I saw that you are allocating buffer for the header to read, but the length of the buffer actually the stream length
bufferLength = stream.Length;
why don't you get that maximum headerlength from the known formats ?
It did used to work that way - mostly likely changed to allow for certain file types (e.g. CFB-derived formats) to be able to be able to read the entire stream. Can probably tweak the logic there to optimise.
Still leaves an issue with CompoundFileBinary
format, but are those an issue for you?
Yes, we have to check them too. I've checked that CompoundFile is not so aggressive in memory allocations. Is there a way that the stream itself is supplied to the method and not a byte array that is loaded and wrapped in a memoryStream ?
using (var fileData = new MemoryStream(header))
{
CompoundFile cf = new CompoundFile(fileData);
return cf.RootStorage.TryGetStream(Storage) != null ? true : false;
}
Just to explain why I have my suspicions about the CompoundFileBinary format being the issue here, if I run some initial benchmarks, here is the memory usage.
Method | File | Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|
DetermineFileFormat | msg | 20 KB | 125.0000 | - | - | 386.77 KB |
DetermineFileFormat | msg | 21 KB | 121.0938 | - | - | 378.33 KB |
DetermineFileFormat | ppt | 96 KB | 92.7734 | 30.2734 | 30.2734 | 320.03 KB |
DetermineFileFormat | ppt | 84 KB | 80.0781 | 26.3672 | 26.3672 | 310.38 KB |
DetermineFileFormat | odp | 196 KB | 61.5234 | 61.5234 | 61.5234 | 289.74 KB |
DetermineFileFormat | vsd | 42 KB | 85.4492 | - | - | 264.7 KB |
DetermineFileFormat | doc | 22 KB | 79.1016 | - | - | 244.83 KB |
DetermineFileFormat | xls | 25 KB | 74.7070 | - | - | 230.35 KB |
DetermineFileFormat | xps | 141 KB | 45.4102 | 45.4102 | 45.4102 | 190.66 KB |
DetermineFileFormat | pptx | 29 KB | 59.5703 | - | - | 184.97 KB |
DetermineFileFormat | 138 KB | 43.4570 | 43.4570 | 43.4570 | 149.31 KB | |
DetermineFileFormat | vsdx | 30 KB | 36.6211 | - | - | 113.73 KB |
DetermineFileFormat | ods | 19 KB | 33.2031 | - | - | 102.08 KB |
DetermineFileFormat | docx | 12 KB | 20.9961 | - | - | 65.59 KB |
DetermineFileFormat | bmp | 44 KB | 17.8223 | - | - | 55.22 KB |
DetermineFileFormat | jpg | 33 KB | 14.4043 | - | - | 44.53 KB |
DetermineFileFormat | png | 11 KB | 7.0801 | - | - | 22.16 KB |
DetermineFileFormat | rtf | 38 KB | 16.1133 | - | - | 49.78 KB |
Notice how much memory is being used relative to the file size for the CompoundFileBinary derived types (msg ppt, vsd, doc, xls, etc.):
If I then comment out the creation of CompoundFile
, leaving the memory stream in place:
using (var fileData = new MemoryStream(header))
{
// var cf = new CompoundFile(fileData);
// return cf.RootStorage.TryGetStream(Storage) != null ? true : false;
return false;
}
The relative memory usage of the CFB types drops hugely:
Method | File | Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|
DetermineFileFormat | odp | 196 KB | 43.4570 | 61.5234 | 61.5234 | 289.74 KB |
DetermineFileFormat | xps | 141 KB | 30.7617 | 45.4102 | 45.4102 | 190.66 KB |
DetermineFileFormat | pptx | 29 KB | 16.1133 | - | - | 184.97 KB |
DetermineFileFormat | 138 KB | 26.8555 | 43.4570 | 43.4570 | 149.31 KB | |
DetermineFileFormat | vsdx | 30 KB | 11.7188 | - | - | 113.73 KB |
DetermineFileFormat | ppt | 96 KB | 10.9863 | 30.7617 | 30.7617 | 108.37 KB |
DetermineFileFormat | ods | 19 KB | 7.0801 | - | - | 102.08 KB |
DetermineFileFormat | ppt | 84 KB | 17.8223 | 26.8555 | 26.8555 | 95.82 KB |
DetermineFileFormat | docx | 12 KB | 61.5234 | - | - | 65.59 KB |
DetermineFileFormat | bmp | 44 KB | 20.9961 | - | - | 55.22 KB |
DetermineFileFormat | vsd | 42 KB | 36.6211 | - | - | 54.19 KB |
DetermineFileFormat | rtf | 38 KB | 17.3340 | - | - | 49.78 KB |
DetermineFileFormat | jpg | 33 KB | 33.2031 | - | - | 44.53 KB |
DetermineFileFormat | xls | 25 KB | 45.4102 | - | - | 36.62 KB |
DetermineFileFormat | doc | 22 KB | 14.4043 | - | - | 34.11 KB |
DetermineFileFormat | msg | 21 KB | 10.4980 | - | - | 32.61 KB |
DetermineFileFormat | msg | 20 KB | 10.0098 | - | - | 31.6 KB |
DetermineFileFormat | png | 11 KB | 59.5703 | - | - | 22.16 KB |
That said, I agree that the allocation of the header byte array is a problem and could be improved, probably by passing the stream into IsMatch
, as you suggested.
My concern is that doing that won't actually solve the issue, which is why I'll take a closer look at the CompoundFile
usage at the same time.
Hi, when we call DetermineFileFormat(Stream) the methods allocates a lot of memory. I suppose that the problem is in FindMatchingFormats where a byte array is allocated.
Is there a more optimal usage ?