neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
250 stars 41 forks source link

High memory usage in DetermineFileFormat #22

Closed agerchev closed 4 years ago

agerchev commented 4 years ago

Hi, when we call DetermineFileFormat(Stream) the methods allocates a lot of memory. I suppose that the problem is in FindMatchingFormats where a byte array is allocated.

private List<FileFormat> FindMatchingFormats(Stream stream)
        {
            var bufferLength = stream.Length;
            var bytesRead = 0;
            var header = new byte[bufferLength];
            var candidates = knownFormats
                .OrderBy(t => t.HeaderLength)
                .ToList();

Is there a more optimal usage ?

neilharvey commented 4 years ago

I've set up some benchmark tests, and it looks like the culprit may be the CompoundFileBinary format, which is used to read legacy Office formats. It looks like when we instantiate CompoundFile it causes a spike in memory usage. I need to do more diagnostics to get a clearer picture, then go from there.

agerchev commented 4 years ago

@neilharvey we are using it in a service for uploading a lot of files and the profiler shows excessive memory allocations in the method. we can simulate it with code like this one:

FileFormatInspector inspector = new FileFormatInspector();

                for (int i = 0; i < 10000; i++)
                {
                    using (var stream = File.OpenRead(@"C:\Tmp\test.pdf"))
                    {
                        FileFormat detectedUploadedFileType = inspector.DetermineFileFormat(stream);
                    }
                }

here is the profiling info:

FileFormatInspectorMemoryUsage

I saw that you are allocating buffer for the header to read, but the length of the buffer actually the stream length bufferLength = stream.Length; why don't you get that maximum headerlength from the known formats ?

neilharvey commented 4 years ago

It did used to work that way - mostly likely changed to allow for certain file types (e.g. CFB-derived formats) to be able to be able to read the entire stream. Can probably tweak the logic there to optimise.

Still leaves an issue with CompoundFileBinary format, but are those an issue for you?

agerchev commented 4 years ago

Yes, we have to check them too. I've checked that CompoundFile is not so aggressive in memory allocations. Is there a way that the stream itself is supplied to the method and not a byte array that is loaded and wrapped in a memoryStream ?

using (var fileData = new MemoryStream(header))
                {
                    CompoundFile cf = new CompoundFile(fileData);
                    return cf.RootStorage.TryGetStream(Storage) != null ? true : false;
                }
neilharvey commented 4 years ago

Just to explain why I have my suspicions about the CompoundFileBinary format being the issue here, if I run some initial benchmarks, here is the memory usage.

Method File Size Gen 0 Gen 1 Gen 2 Allocated
DetermineFileFormat msg 20 KB 125.0000 - - 386.77 KB
DetermineFileFormat msg 21 KB 121.0938 - - 378.33 KB
DetermineFileFormat ppt 96 KB 92.7734 30.2734 30.2734 320.03 KB
DetermineFileFormat ppt 84 KB 80.0781 26.3672 26.3672 310.38 KB
DetermineFileFormat odp 196 KB 61.5234 61.5234 61.5234 289.74 KB
DetermineFileFormat vsd 42 KB 85.4492 - - 264.7 KB
DetermineFileFormat doc 22 KB 79.1016 - - 244.83 KB
DetermineFileFormat xls 25 KB 74.7070 - - 230.35 KB
DetermineFileFormat xps 141 KB 45.4102 45.4102 45.4102 190.66 KB
DetermineFileFormat pptx 29 KB 59.5703 - - 184.97 KB
DetermineFileFormat pdf 138 KB 43.4570 43.4570 43.4570 149.31 KB
DetermineFileFormat vsdx 30 KB 36.6211 - - 113.73 KB
DetermineFileFormat ods 19 KB 33.2031 - - 102.08 KB
DetermineFileFormat docx 12 KB 20.9961 - - 65.59 KB
DetermineFileFormat bmp 44 KB 17.8223 - - 55.22 KB
DetermineFileFormat jpg 33 KB 14.4043 - - 44.53 KB
DetermineFileFormat png 11 KB 7.0801 - - 22.16 KB
DetermineFileFormat rtf 38 KB 16.1133 - - 49.78 KB

Notice how much memory is being used relative to the file size for the CompoundFileBinary derived types (msg ppt, vsd, doc, xls, etc.):

If I then comment out the creation of CompoundFile, leaving the memory stream in place:

using (var fileData = new MemoryStream(header))
{
    // var cf = new CompoundFile(fileData);
    // return cf.RootStorage.TryGetStream(Storage) != null ? true : false;
    return false;
}

The relative memory usage of the CFB types drops hugely:

Method File Size Gen 0 Gen 1 Gen 2 Allocated
DetermineFileFormat odp 196 KB 43.4570 61.5234 61.5234 289.74 KB
DetermineFileFormat xps 141 KB 30.7617 45.4102 45.4102 190.66 KB
DetermineFileFormat pptx 29 KB 16.1133 - - 184.97 KB
DetermineFileFormat pdf 138 KB 26.8555 43.4570 43.4570 149.31 KB
DetermineFileFormat vsdx 30 KB 11.7188 - - 113.73 KB
DetermineFileFormat ppt 96 KB 10.9863 30.7617 30.7617 108.37 KB
DetermineFileFormat ods 19 KB 7.0801 - - 102.08 KB
DetermineFileFormat ppt 84 KB 17.8223 26.8555 26.8555 95.82 KB
DetermineFileFormat docx 12 KB 61.5234 - - 65.59 KB
DetermineFileFormat bmp 44 KB 20.9961 - - 55.22 KB
DetermineFileFormat vsd 42 KB 36.6211 - - 54.19 KB
DetermineFileFormat rtf 38 KB 17.3340 - - 49.78 KB
DetermineFileFormat jpg 33 KB 33.2031 - - 44.53 KB
DetermineFileFormat xls 25 KB 45.4102 - - 36.62 KB
DetermineFileFormat doc 22 KB 14.4043 - - 34.11 KB
DetermineFileFormat msg 21 KB 10.4980 - - 32.61 KB
DetermineFileFormat msg 20 KB 10.0098 - - 31.6 KB
DetermineFileFormat png 11 KB 59.5703 - - 22.16 KB

That said, I agree that the allocation of the header byte array is a problem and could be improved, probably by passing the stream into IsMatch, as you suggested.

My concern is that doing that won't actually solve the issue, which is why I'll take a closer look at the CompoundFile usage at the same time.