neilharvey / FileSignatures

A small library for detecting the type of a file based on header signature (also known as magic number).
MIT License
250 stars 41 forks source link

Extracting the file signature without reading the entire file #34

Closed rebeccapowell closed 3 years ago

rebeccapowell commented 3 years ago

Based on the code it looks like the entire file is read into memory, rather than the just the file header.

https://github.com/neilharvey/FileSignatures/blob/be7656778addb83042a585fec05e8e6254b9bc72/src/FileSignatures/FileFormatInspector.cs#L89

Is that correct?

Since you know the max length of all headers:

https://github.com/neilharvey/FileSignatures/blob/be7656778addb83042a585fec05e8e6254b9bc72/src/FileSignatures/FileFormatInspector.cs#L71

Would it be possible to simply only read the part of the file that is required for the identification?

var maxHeaderLength = _formats
    .Max(t => t.HeaderLength).HeaderLength;

var headerData = new byte[n];
var bytesRead = 0;
while (bytesRead < maxHeaderLength)
    bytesRead += await x.ReadAsync(headerData.AsMemory(bytesRead));

for (int i = 0; i < candidates.Count; i++)
{
    if (!candidates[i].IsMatch(headerData))
    {
        candidates.RemoveAt(i);
        i--;
    }
}

For really large files, reading the entire file, when you only need the first few bytes is a bit painful.

neilharvey commented 3 years ago

The file should only be read in it's entirety in certain circumstances. The IFileFormatReader interface is used by a couple of formats, specifically OfficeOpenXml and CompoundFileBinary.

OfficeOpenXml (Office) files are based on the Zip format and the detection is done by looking at the list of entries in the zip file for a particular item which we use to identify the type. Unfortunately this requires that we read the entire zip into memory, if I recall correctly the content table is stored at the tail of the zip so it is unavoidable. For CompoundFileBinary (legacy Office files) we pass the stream off to another library, which I believe needs to read it to the end but I would need to check.

For other formats it shouldn't read the entire file, it should iterate through the all the formats and discard the ones which do not match (many of these will be discarded after reading the first byte or two) until it finds a match.

neilharvey commented 3 years ago

Closing - answer as per above.